Filtering and grouping

Use boolean indexing to filter rows and .groupby().agg() to compute group summaries in pandas.

Now that you understand split-apply-combine conceptually, it is time to write it. Two operations — boolean indexing and .groupby() — cover the vast majority of exploratory analysis work. The DataFrame is the central object for both.

Python — editable, runs in your browser

import pandas as pd

data = {
  "sale_id":  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  "region":   ["North", "South", "North", "East", "South", "North", "East", "South", "East", "North"],
  "product":  ["Widget", "Gadget", "Widget", "Gadget", "Widget", "Gadget", "Widget", "Gadget", "Widget", "Widget"],
  "amount":   [120, 85, 200, 45, 130, 95, 60, 110, 75, 180],
  "returned": [False, False, True, False, False, False, True, False, False, False],
}
df = pd.DataFrame(data)

# --- Boolean indexing ---
# A condition produces a Series of True/False values
is_north = df["region"] == "North"
print("=== North region only ===")
print(df[is_north])

# Combine conditions with & (and) or | (or) — wrap each in parentheses
print("\n=== North, not returned ===")
north_kept = df[(df["region"] == "North") & (df["returned"] == False)]
print(north_kept)

# --- groupby + agg ---
print("\n=== Total and mean amount by region ===")
summary = df.groupby("region")["amount"].agg(["sum", "mean", "count"])
print(summary)

print("\n=== Mean amount by region and product ===")
print(df.groupby(["region", "product"])["amount"].mean())

Boolean indexing

df["region"] == "North" produces a boolean Series — one True or False per row. Passing that Series inside df[...] keeps only the rows where the value is True. This is called boolean indexing (or boolean masking).

You can combine conditions:

& means AND — both conditions must be true
| means OR — either condition must be true
~ negates a condition

Always wrap each condition in parentheses when combining them. Without parentheses, Python's operator precedence causes confusing errors.

.groupby().agg()

.groupby("region") tells pandas to split the DataFrame by the "region" column. ["amount"] selects the column to aggregate. .agg(["sum", "mean", "count"]) applies three aggregations at once and returns a DataFrame with one row per group and one column per aggregation.

You can pass a single string ("mean") or a list. Common aggregation names: "sum", "mean", "median", "min", "max", "count", "std".

Grouping by multiple columns — groupby(["region", "product"]) — creates one row per unique combination. The index becomes a MultiIndex, which you can flatten with .reset_index() if you need a plain DataFrame.

A common pattern is to filter first, then group. For example: exclude returned sales, then compute totals by region. Filtering removes unwanted rows before the split step, so the groups are already clean when you aggregate.

Where to go next

You have all the core tools. The analysis lab is next: a longer, guided practice session where you load, clean, filter, and group a dataset to answer real questions.

Finished reading? Mark it complete to track your progress.

Boolean indexing

.groupby().agg()

Where to go next

On this page