Cleaning data
Drop missing values, remove duplicates, and fix column types in pandas — the three moves that fix most data quality problems.
- Drop rows with missing values using .dropna()
- Remove duplicate rows using .drop_duplicates()
- Convert a column to the correct type using .astype()
Knowing the four data quality problems is not enough — you need to fix them. This
lesson covers the three pandas methods that handle the most common cases: .dropna(),
.drop_duplicates(), and .astype(). The code below creates a deliberately messy
DataFrame and cleans it step by step.
What each step does
.dropna() removes rows where any value is NaN or None. By default it
checks all columns — pass subset=["column_name"] to only check specific ones.
The result here drops order 3's duplicate and the row with no customer name.
Wait — it also drops the row where amount is None. After .dropna() you are
left with 4 rows. Then .drop_duplicates() removes the second occurrence of
order 3 (both rows were identical), leaving 3 rows.
.drop_duplicates() compares entire rows by default. Pass subset=["order_id"]
if you want to deduplicate by a specific key column (keeping the first occurrence).
.astype(float) converts the column in-place (on a copy). If any value cannot
be converted — say the string "N/A" — pandas will raise a ValueError. That
is usually what you want: the error tells you there is another problem to fix,
rather than silently producing NaN.
pandas methods like .dropna() and .drop_duplicates() return a new DataFrame
by default — they do not modify the original. Assign the result to a new name
(as done above) or use inplace=True. Keeping the original around while you
experiment is a good habit.
Where to go next
Now you can inspect and clean a dataset. The lab is next: an end-to-end practice session where you apply all of these skills to a new dataset without step-by-step instructions.
Data quality
Missing values, duplicates, wrong types, and outliers contaminate almost every real dataset — learn to recognise them before they corrupt your analysis.
Lab: explore a dataset
Apply inspection and cleaning end-to-end on a new dataset — no step-by-step instructions, just prompts and a starter block.