Lab: data pipeline

Build a small ETL pipeline — read CSV, validate, transform, group, and serialise to JSON.

Optional lab. You'll connect the four lessons of this module into one small pipeline: read rows, validate them, transform fields, group by city, and serialise the result to JSON. Each checkpoint builds on the previous one.

The data

Imagine you're ingesting a CSV export from a registration system. The raw rows arrive as dictionaries of strings — exactly what csv.DictReader produces:

Python — editable, runs in your browser

Notice the problems: names with whitespace, mixed-case cities, an empty name, and a score over 100. A real pipeline handles these before the data reaches any downstream system.

Checkpoint 1 — parse a row

Write parse_row to convert one raw dict (all strings) into a cleaned, typed record: strip the name, convert the score to int, and upper-case the city.

Parse and clean a CSV rowPython

Write parse_row(row) that returns a new dict with: "name" stripped of whitespace, "score" converted to int, and "city" stripped and upper-cased.

parse_row({"name": " Alice ", "score": "91", "city": "london"}) → {"name": "Alice", "score": 91, "city": "LONDON"}

Checkpoint 2 — validate a record

After parsing, filter out bad records. Write is_valid_record that returns True only when the name is non-empty (after stripping) and the score is between 0 and 100 inclusive.

Validate a parsed recordPython

Write is_valid_record(record) that returns True if record["name"] is non-empty (after stripping) and record["score"] is between 0 and 100 inclusive.

is_valid_record({"name": "Alice", "score": 85}) → Trueis_valid_record({"name": "", "score": 85}) → False

The pipeline so far

With both pieces in place, run the transformation and validation together. Verify that two bad rows are dropped:

Python — editable, runs in your browser

Four records survive — Alice, Bob, Carol, and Eve.

Checkpoint 3 — group by city

Now group the valid records. Write group_by_city that uses a collections.defaultdict(list) to bucket each record's name under its city key, and returns a plain dict.

Group records by cityPython

Write group_by_city(records) that returns a dict mapping each city string to a list of the names of records in that city. Use collections.defaultdict(list).

group_by_city([{"name":"Alice","score":91,"city":"LONDON"},{"name":"Carol","score":85,"city":"LONDON"}]) → {"LONDON": ["Alice", "Carol"]}

Checkpoint 4 — serialise the summary

The final step: take the grouped dict and produce a JSON string. Write serialize_summary that returns json.dumps(...) with sort_keys=True and indent=2, sorting the names lists first so output is deterministic.

Serialise the grouped summary to JSONPython

Write serialize_summary(groups) that takes a dict of {city: [names]} and returns a JSON string (indent=2, sort_keys=True). Sort each name list before serialising.

serialize_summary({"LONDON": ["Carol", "Alice"], "PARIS": ["Bob"]}) → {
  "LONDON": [
    "Alice",
    "Carol"
  ],
  "PARIS": [
    "Bob"
  ]
}

Putting it all together

Four green checks. Here is the complete pipeline assembled — this is the shape real ETL code takes, just at a larger scale:

Python — editable, runs in your browser

import json
from collections import defaultdict

def parse_row(row):
  return {
      "name":  row["name"].strip(),
      "score": int(row["score"]),
      "city":  row["city"].strip().upper(),
  }

def is_valid_record(record):
  return bool(record["name"].strip()) and 0 <= record["score"] <= 100

def group_by_city(records):
  groups = defaultdict(list)
  for r in records:
      groups[r["city"]].append(r["name"])
  return dict(groups)

def serialize_summary(groups):
  summary = {city: sorted(names) for city, names in groups.items()}
  return json.dumps(summary, sort_keys=True, indent=2)

# --- run the pipeline ---
raw_rows = [
  {"name": " Alice ",  "score": "91", "city": "london"},
  {"name": "Bob",      "score": "78", "city": "paris "},
  {"name": "  Carol",  "score": "85", "city": "LONDON"},
  {"name": "",         "score": "42", "city": "berlin"},
  {"name": "Dave",     "score": "110","city": "paris"},
  {"name": "Eve",      "score": "67", "city": "berlin"},
]

records = [parse_row(r) for r in raw_rows]
valid   = [r for r in records if is_valid_record(r)]
groups  = group_by_city(valid)
output  = serialize_summary(groups)
print(output)

This is ETL in miniature: Extract (parse raw strings), Transform (validate, normalise, group), Load (serialise to a target format). The same pattern scales to thousands of rows and richer schemas — only the domain logic inside each step changes.

Finished reading? Mark it complete to track your progress.

The data

Checkpoint 1 — parse a row

Checkpoint 2 — validate a record

The pipeline so far

Checkpoint 3 — group by city

Checkpoint 4 — serialise the summary

Putting it all together

On this page