Lab: Optimise a utility
Profile a provided file processor, identify the memory hotspot, convert to streaming, and verify it handles 1 million lines.
- Profile a buffering utility with tracemalloc
- Identify the list accumulation responsible for peak allocation
- Convert the function to a generator pipeline
- Verify the streaming version handles 1 million lines without OOM
This lab provides a file processor that reads an entire log file into memory before doing any work. Your task: profile it, find the hotspot, and convert it to a streaming pipeline. The final version must handle a one-million-line generated input without running out of memory.
The starting utility
Run this first to understand what it does and to establish the memory baseline:
You should see three separate list allocations in the top statistics — one each
for parse_log, filter_errors, and format_output. The peak is roughly
three times the minimum needed because all three lists exist simultaneously at
some point during execution.
Step 1 — Convert parse_log to a generator
Replace return records with yield:
The function now yields one dict at a time. The caller controls how many records are in memory simultaneously.
Step 2 — Convert filter_errors and format_output
Both can become generators with the same change:
Step 3 — Verify with 1 million lines
The final test: generate one million lines and process them with tracemalloc active. Peak memory should stay under a few megabytes:
One million lines, constant memory. The output count should be 100,000 (every tenth line is an ERROR). Peak allocation should be under 1 MB — the generator machinery, the current dict, and the current output string.
In a real CLI, generate_lines would be replaced by iteration over an open
file: for line in open("access.log"). The pipeline is identical; only the
source changes. This is why generator pipelines compose so cleanly with file
input.
What you practised
You profiled a buffering utility, identified three separate list allocations as the hotspot, and converted each function to a generator with a one-word change. The result processes arbitrarily large input in constant memory without changing the observable output. That is the core of the streaming refactor pattern in CLI tools.
Where to go next
Next module: text user interfaces — moving from batch-processing CLIs to
persistent terminal applications with textual.