Code of the Day
BeginnerExploring Data

Grouping concepts

Split-apply-combine is the pattern behind every group-level analysis — understand it once and groupby becomes obvious.

Data ScienceBeginner6 min read
By the end of this lesson you will be able to:
  • Describe the split-apply-combine pattern in plain language
  • Give a concrete example of when you would group data
  • Define aggregation and distinguish it from filtering

The most common question in data analysis is not "what is the average value?" — it is "what is the average value within each group?" Total sales are interesting; sales by region are actionable. Average score across all students is a baseline; average score by track tells you where to focus.

The pattern that makes this possible is called split-apply-combine.

Split — apply — combine

Consider a table of monthly sales, where each row is one sale with a region column:

sale_idregionamount
1North120
2South85
3North200
4East45
5South130
6North95

Split: divide the table into sub-tables, one per region.

  • North: rows 1, 3, 6
  • South: rows 2, 5
  • East: row 4

Apply: run the same calculation on each sub-table independently.

  • North total: 120 + 200 + 95 = 415
  • South total: 85 + 130 = 215
  • East total: 45

Combine: collect the results into a new table.

regiontotal
North415
South215
East45

That's it. Every groupby() operation in follows this pattern, regardless of whether the aggregation is a sum, a mean, a count, or something custom.

What aggregation means

Aggregation is any calculation that takes multiple values and produces one. Sum, mean, median, count, min, and max are all aggregations. The key idea: after the apply step, each group has been compressed from many rows into one number.

This is different from filtering, which keeps or removes individual rows without combining them. Filtering asks "which rows match a condition?"; aggregation asks "what is the summary of a group?"

You can also group by multiple columns — for example, region and product category. The split step creates one sub-table for each unique combination of values. The apply and combine steps work identically.

When to reach for groupby

Reach for groupby whenever your question contains the word "by" or "per":

  • Revenue by product
  • Average rating per author
  • Number of errors per hour
  • Top score by student cohort

If your question is just "what is the total revenue?" with no grouping, a plain .sum() is enough. The moment you need that number for each group separately, split-apply-combine is the pattern.

Check your understanding

Knowledge check

  1. 1.
    In split-apply-combine, what happens in the "apply" step?
  2. 2.
    Filtering and aggregation both reduce the number of rows in a DataFrame.
  3. 3.
    Which of these questions calls for groupby?

Where to go next

Next: filtering and grouping — the hands-on companion to this lesson. You will use boolean indexing to filter rows and .groupby().agg() to compute group summaries in pandas.

Finished reading? Mark it complete to track your progress.

On this page