Don't buffer what you can stream
Why reading an entire file into memory crashes large-input tools, and how generator-based streaming fixes it.
- Explain why buffering a 10 GB file crashes a CLI tool
- Describe how generators produce values on demand
- Identify the Unix line-at-a-time processing pattern
The most common performance mistake in CLI tools is not a slow algorithm — it is loading the entire input into memory before doing any work. A tool that processes log files works fine in development (100 KB logs) and silently fails in production (50 GB logs).
The cost of buffering
Consider a log processor that does this:
with open("access.log") as f:
lines = f.readlines() # entire file in RAM
for line in lines:
process(line)readlines() allocates a Python list containing every line as a string object.
A 10 GB file consumes at least 10 GB of RAM for the raw bytes, plus Python's
object overhead — in practice 20–30 GB. On a server with 16 GB RAM, the process
is killed by the OS before it finishes.
The fix requires no new dependencies and no architectural change:
with open("access.log") as f:
for line in f: # one line at a time
process(line)The file object is an iterator. Iterating over it yields one line, processes it, then discards it. Memory stays near zero regardless of file size. The runtime is the same — the work is identical — but the peak allocation drops from O(n) to O(1).
Generators extend the pattern
A function that yields values instead of building a list is a generator. The
caller pulls one item at a time; the generator runs until it hits yield, then
pauses. No items accumulate:
def parse_log_lines(path):
with open(path) as f:
for line in f:
if line.strip():
yield line.rstrip()
def extract_ips(lines):
for line in lines:
yield line.split()[0] # first field is the IP
for ip in extract_ips(parse_log_lines("access.log")):
record(ip)Each stage in the pipeline processes one line, passes it to the next stage, and immediately forgets it. You can chain ten stages this way and peak memory is still O(1 line).
The Unix philosophy connection
Unix tools — grep, awk, sed, sort -u — are all line-at-a-time processors
connected by pipes. grep does not read the entire input before printing matches;
it prints each match as it finds it. Your tool, when used in a pipeline, should
behave the same way.
for line in sys.stdin achieves this automatically:
import sys
for line in sys.stdin:
result = process(line.rstrip())
print(result)Output appears as input arrives. The tool composes naturally with other Unix tools, handles infinite streams, and never buffers more than one line.
Some operations genuinely require the entire input: sorting, deduplication, computing a median. When you need to buffer, be explicit about it and document the memory requirement. The problem is when buffering happens accidentally — not when it happens intentionally.
Where to go next
Next: streaming in practice — a side-by-side Runnable comparing a list-buffering function to its generator equivalent, with memory measurements.