Grouping concepts
Split-apply-combine is the pattern behind every group-level analysis — understand it once and groupby becomes obvious.
- Describe the split-apply-combine pattern in plain language
- Give a concrete example of when you would group data
- Define aggregation and distinguish it from filtering
The most common question in data analysis is not "what is the average value?" — it is "what is the average value within each group?" Total sales are interesting; sales by region are actionable. Average score across all students is a baseline; average score by track tells you where to focus.
The pattern that makes this possible is called split-apply-combine.
Split — apply — combine
Consider a table of monthly sales, where each row is one sale with a region
column:
| sale_id | region | amount |
|---|---|---|
| 1 | North | 120 |
| 2 | South | 85 |
| 3 | North | 200 |
| 4 | East | 45 |
| 5 | South | 130 |
| 6 | North | 95 |
Split: divide the table into sub-tables, one per region.
- North: rows 1, 3, 6
- South: rows 2, 5
- East: row 4
Apply: run the same calculation on each sub-table independently.
- North total: 120 + 200 + 95 = 415
- South total: 85 + 130 = 215
- East total: 45
Combine: collect the results into a new table.
| region | total |
|---|---|
| North | 415 |
| South | 215 |
| East | 45 |
That's it. Every groupby() operation in pandas follows this pattern, regardless
of whether the aggregation is a sum, a mean, a count, or something custom.
What aggregation means
Aggregation is any calculation that takes multiple values and produces one. Sum, mean, median, count, min, and max are all aggregations. The key idea: after the apply step, each group has been compressed from many rows into one number.
This is different from filtering, which keeps or removes individual rows without combining them. Filtering asks "which rows match a condition?"; aggregation asks "what is the summary of a group?"
You can also group by multiple columns — for example, region and product category. The split step creates one sub-table for each unique combination of values. The apply and combine steps work identically.
When to reach for groupby
Reach for groupby whenever your question contains the word "by" or "per":
- Revenue by product
- Average rating per author
- Number of errors per hour
- Top score by student cohort
If your question is just "what is the total revenue?" with no grouping, a plain
.sum() is enough. The moment you need that number for each group separately,
split-apply-combine is the pattern.
Check your understanding
Knowledge check
- 1.In split-apply-combine, what happens in the "apply" step?
- 2.Filtering and aggregation both reduce the number of rows in a DataFrame.
- 3.Which of these questions calls for groupby?
Where to go next
Next: filtering and grouping — the hands-on companion to this lesson. You will
use boolean indexing to filter rows and .groupby().agg() to compute group
summaries in pandas.