Filtering and grouping
Use boolean indexing to filter rows and .groupby().agg() to compute group summaries in pandas.
- Filter a DataFrame using boolean indexing
- Use .groupby() with .agg() to compute summaries per group
- Combine filtering and grouping in a single analysis
Now that you understand split-apply-combine conceptually, it is time to write it.
Two operations — boolean indexing and .groupby() — cover the vast majority of
exploratory analysis work. The DataFrame is the central object for both.
Boolean indexing
df["region"] == "North" produces a boolean Series — one True or False per
row. Passing that Series inside df[...] keeps only the rows where the value is
True. This is called boolean indexing (or boolean masking).
You can combine conditions:
&means AND — both conditions must be true|means OR — either condition must be true~negates a condition
Always wrap each condition in parentheses when combining them. Without parentheses, Python's operator precedence causes confusing errors.
.groupby().agg()
.groupby("region") tells pandas to split the DataFrame by the "region" column.
["amount"] selects the column to aggregate. .agg(["sum", "mean", "count"])
applies three aggregations at once and returns a DataFrame with one row per group
and one column per aggregation.
You can pass a single string ("mean") or a list. Common aggregation names:
"sum", "mean", "median", "min", "max", "count", "std".
Grouping by multiple columns — groupby(["region", "product"]) — creates one row
per unique combination. The index becomes a MultiIndex, which you can flatten with
.reset_index() if you need a plain DataFrame.
A common pattern is to filter first, then group. For example: exclude returned sales, then compute totals by region. Filtering removes unwanted rows before the split step, so the groups are already clean when you aggregate.
Where to go next
You have all the core tools. The analysis lab is next: a longer, guided practice session where you load, clean, filter, and group a dataset to answer real questions.