Code of the Day
AdvancedPerformance and Streaming

Lab: Optimise a utility

Profile a provided file processor, identify the memory hotspot, convert to streaming, and verify it handles 1 million lines.

Lab · optionalUtilitiesAdvanced30 min
Recommended first
By the end of this lesson you will be able to:
  • Profile a buffering utility with tracemalloc
  • Identify the list accumulation responsible for peak allocation
  • Convert the function to a generator pipeline
  • Verify the streaming version handles 1 million lines without OOM

This lab provides a file processor that reads an entire log file into memory before doing any work. Your task: profile it, find the hotspot, and convert it to a streaming pipeline. The final version must handle a one-million-line generated input without running out of memory.

The starting utility

Run this first to understand what it does and to establish the memory baseline:

Python — editable, runs in your browser

You should see three separate list allocations in the top statistics — one each for parse_log, filter_errors, and format_output. The peak is roughly three times the minimum needed because all three lists exist simultaneously at some point during execution.

Step 1 — Convert parse_log to a generator

Replace return records with yield:

Python — editable, runs in your browser

The function now yields one dict at a time. The caller controls how many records are in memory simultaneously.

Step 2 — Convert filter_errors and format_output

Both can become generators with the same change:

Python — editable, runs in your browser

Step 3 — Verify with 1 million lines

The final test: generate one million lines and process them with tracemalloc active. Peak memory should stay under a few megabytes:

Python — editable, runs in your browser

One million lines, constant memory. The output count should be 100,000 (every tenth line is an ERROR). Peak allocation should be under 1 MB — the generator machinery, the current dict, and the current output string.

In a real CLI, generate_lines would be replaced by iteration over an open file: for line in open("access.log"). The pipeline is identical; only the source changes. This is why generator pipelines compose so cleanly with file input.

What you practised

You profiled a buffering utility, identified three separate list allocations as the hotspot, and converted each function to a generator with a one-word change. The result processes arbitrarily large input in constant memory without changing the observable output. That is the core of the streaming refactor pattern in CLI tools.

Where to go next

Next module: text user interfaces — moving from batch-processing CLIs to persistent terminal applications with textual.

Finished reading? Mark it complete to track your progress.

On this page