Lab: explore a dataset
Apply inspection and cleaning end-to-end on a new dataset — no step-by-step instructions, just prompts and a starter block.
- Apply .head(), .shape, .dtypes, and .describe() to a new dataset
- Identify and fix missing values, duplicates, and wrong types without guidance
This is an optional lab. No new concepts — just practice applying everything from the Data Fundamentals module to a dataset you have not seen before. Work through the prompts, run the code, and check that your cleaned DataFrame makes sense.
The dataset below is a small record of library book loans. It has all four data quality problems from the previous lesson. Your job is to find them and fix them.
The dataset
Before changing anything, answer these questions by adding print() calls above:
- What is the shape of the DataFrame?
- What are the dtypes? Which columns are the wrong type?
- Does
.describe()reveal anything suspicious?
Step 1 — check for missing values
Find out which rows have missing data. df.isnull().sum() tells you how many
nulls are in each column. Then use .dropna() to remove those rows.
Step 2 — remove duplicates
Loan 103 appears twice. Use .drop_duplicates() to keep only the first occurrence.
Print the shape before and after.
Step 3 — fix types and investigate the outlier
Convert days_on_loan to int. Then check the maximum — loan 105 has 999 days,
which looks like a sentinel value. Filter it out with boolean indexing:
df[df["days_on_loan"] < 100].
After cleaning you have 5 rows from 8. You dropped 1 row with a missing
member_id, 1 with a missing returned value, 1 duplicate, and 1 sentinel
outlier. Every step was deliberate — you looked at the data, diagnosed the
problem, and chose the appropriate fix.
Done?
You just ran a complete data inspection and cleaning pass — the same workflow a data scientist applies to every new dataset. The next module, Exploring Data, picks up here: with clean data in hand, you will calculate statistics and group your results to find patterns.