Lab: Optimise a utility

Profile a provided file processor, identify the memory hotspot, convert to streaming, and verify it handles 1 million lines.

This lab provides a file processor that reads an entire log file into memory before doing any work. Your task: profile it, find the hotspot, and convert it to a streaming pipeline. The final version must handle a one-million-line generated input without running out of memory.

The starting utility

Run this first to understand what it does and to establish the memory baseline:

Python — editable, runs in your browser

import tracemalloc

# ----- Provided utility — do not change this cell -----

def parse_log(lines):
  """Parse raw log lines into dicts. Returns a list."""
  records = []
  for line in lines:
      parts = line.strip().split("|")
      if len(parts) == 3:
          records.append({
              "ts": parts[0].strip(),
              "level": parts[1].strip(),
              "msg": parts[2].strip(),
          })
  return records   # <-- entire list kept in memory

def filter_errors(records):
  """Keep only ERROR records. Returns a list."""
  return [r for r in records if r["level"] == "ERROR"]

def format_output(records):
  """Format records for display. Returns a list."""
  return ["[{ts}] {msg}".format(**r) for r in records]

def process_log(lines):
  parsed = parse_log(lines)
  errors = filter_errors(parsed)
  return format_output(errors)

# ----- Baseline profile with 10 000 lines -----
def make_lines(n):
  lines = []
  for i in range(n):
      level = "ERROR" if i % 10 == 0 else "INFO"
      lines.append(f"2025-06-01 12:{i % 60:02d}:00 | {level} | event {i}")
  return lines

tracemalloc.start()
sample = make_lines(10_000)
results = process_log(sample)
snapshot = tracemalloc.take_snapshot()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"Current: {current / 1024:.0f} KB")
print(f"Peak:    {peak / 1024:.0f} KB")
print(f"Results: {len(results)} error lines")
print()
stats = snapshot.statistics("lineno")
print("Top 5 allocations:")
for s in stats[:5]:
  print(" ", s)

You should see three separate list allocations in the top statistics — one each for parse_log, filter_errors, and format_output. The peak is roughly three times the minimum needed because all three lists exist simultaneously at some point during execution.

Step 1 — Convert parse_log to a generator

Replace return records with yield:

Python — editable, runs in your browser

The function now yields one dict at a time. The caller controls how many records are in memory simultaneously.

Step 2 — Convert filter_errors and format_output

Both can become generators with the same change:

Python — editable, runs in your browser

Step 3 — Verify with 1 million lines

The final test: generate one million lines and process them with tracemalloc active. Peak memory should stay under a few megabytes:

Python — editable, runs in your browser

import tracemalloc

def parse_log(lines):
  for line in lines:
      parts = line.strip().split("|")
      if len(parts) == 3:
          yield {
              "ts": parts[0].strip(),
              "level": parts[1].strip(),
              "msg": parts[2].strip(),
          }

def filter_errors(records):
  for r in records:
      if r["level"] == "ERROR":
          yield r

def format_output(records):
  for r in records:
      yield "[{ts}] {msg}".format(**r)

def process_log(lines):
  return format_output(filter_errors(parse_log(lines)))

# Generator that produces 1 000 000 lines without storing them
def generate_lines(n):
  for i in range(n):
      level = "ERROR" if i % 10 == 0 else "INFO"
      yield f"2025-06-01 12:{i % 60:02d}:{i % 60:02d} | {level} | event {i}"

tracemalloc.start()

count = 0
for _ in process_log(generate_lines(1_000_000)):
  count += 1

current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"Processed: {count:,} error lines from 1,000,000 total")
print(f"Current:   {current / 1024:.0f} KB")
print(f"Peak:      {peak / 1024:.0f} KB")

One million lines, constant memory. The output count should be 100,000 (every tenth line is an ERROR). Peak allocation should be under 1 MB — the generator machinery, the current dict, and the current output string.

In a real CLI, generate_lines would be replaced by iteration over an open file: for line in open("access.log"). The pipeline is identical; only the source changes. This is why generator pipelines compose so cleanly with file input.

What you practised

You profiled a buffering utility, identified three separate list allocations as the hotspot, and converted each function to a generator with a one-word change. The result processes arbitrarily large input in constant memory without changing the observable output. That is the core of the streaming refactor pattern in CLI tools.

Where to go next

Next module: text user interfaces — moving from batch-processing CLIs to persistent terminal applications with textual.

Finished reading? Mark it complete to track your progress.

The starting utility

Step 1 — Convert parse_log to a generator

Step 2 — Convert filter_errors and format_output

Step 3 — Verify with 1 million lines

What you practised

Where to go next

On this page