Computer Architecture
How a CPU actually executes your code — the fetch-decode-execute cycle, registers, the memory hierarchy, and why cache locality is the single biggest performance lever you control.
- Describe the fetch-decode-execute cycle and what happens at each stage
- Distinguish registers, L1/L2/L3 cache, RAM, and disk by speed and capacity
- Explain what pipelining is and why branch mispredictions are costly
- Define cache locality and give a concrete example of code that exploits it
When you write a loop in Python or C, you probably think of it as "doing work some number of times." The CPU sees something very different: a stream of individual instructions, each manipulating a handful of bytes inside a tiny set of storage cells called registers. Everything about performance — and a surprising amount about correctness — flows from understanding how that stream moves through the hardware.
The fetch-decode-execute cycle
Every CPU core repeats the same three-step cycle billions of times per second:
- Fetch — The program counter (PC) register holds the address of the next instruction. The CPU loads that instruction from memory into an instruction register.
- Decode — Specialised circuits parse the binary instruction: which operation is this? Which registers are the operands? Is there an immediate value embedded in the instruction?
- Execute — The arithmetic-logic unit (ALU) or another functional unit performs the operation and writes the result back to a register (or to memory). The PC advances — or, for a branch, jumps — to the next instruction's address.
This cycle, in its pure form, takes one clock cycle per instruction on an ideal machine. Real CPUs are far from ideal, and the gap between the theoretical maximum and what code actually achieves is where almost all performance work happens.
Registers vs cache vs RAM vs disk
The memory hierarchy exists because fast storage is expensive and physically small:
Level Size Latency Notes
----------- ---------- ----------- --------------------------------
Registers ~1 KB total 0 cycles On-chip; ALU operands live here
L1 cache 32–64 KB ~4 cycles Per core; instruction + data split
L2 cache 256 KB–1 MB ~12 cycles Per core; unified
L3 cache 4–32 MB ~40 cycles Shared across cores on same chip
DRAM (RAM) 4–64 GB ~200 cycles Main memory; off-chip
SSD 256 GB–4 TB ~50 000 cycles NVMe, block-based
HDD 1–20 TB ~10 000 000 cycles Mechanical seekThe numbers are approximate and vary by CPU generation, but the ratios are the lesson: an L1 cache hit is fifty times faster than a DRAM access. A disk seek is fifty thousand times slower. When you hear someone say "this algorithm is fast," the real question is often "fast on what level of the hierarchy?"
Registers are the only operands the ALU can directly use. All computation reads from and writes to registers; loads and stores move data between registers and the cache/memory system. A modern desktop CPU has sixteen to thirty-two general-purpose integer registers and a similar number of floating-point/SIMD registers.
Cache is a set of small, fast memories that sit between the registers and DRAM. Cache operates on cache lines — fixed-size blocks, typically 64 bytes. When a load instruction misses the L1 cache, the hardware fetches an entire 64- byte line from the next level of the hierarchy, even if you only wanted 4 bytes. This prefetches nearby data, which is great if your access pattern is sequential (you're likely to use those nearby bytes) and wasteful if it isn't.
An L1 cache miss costs roughly 40–200 cycles depending on whether the data is
in L2, L3, or DRAM. A function that does 5 arithmetic operations but suffers
one DRAM miss per iteration is bottlenecked on memory, not the CPU. Profilers
that count cache misses (Linux perf, Intel VTune) reveal this directly.
Pipelining and branch prediction
To get more work done per clock, every modern CPU uses a pipeline: it overlaps the fetch, decode, and execute stages of multiple instructions. While instruction N is executing, instruction N+1 is being decoded and instruction N+2 is being fetched. A typical desktop pipeline has 14–20 stages, meaning up to 20 instructions are in flight simultaneously.
The pipeline breaks down at branch instructions (if, while, loop
conditions). When the CPU fetches the instruction after a branch, it doesn't yet
know which way the branch will go — the condition hasn't been evaluated yet. The
solution is branch prediction: the CPU guesses which direction the branch
will take based on historical patterns, and starts executing speculatively along
the predicted path.
If the prediction is correct, no time is lost. If it is wrong, the speculatively executed instructions must be discarded (pipeline flush), and execution restarts from the correct path. A misprediction on a 20-stage pipeline throws away roughly 15 cycles of work. At a billion iterations per second, a 1% misprediction rate costs tens of millions of wasted cycles per second.
// Branch-heavy — predictor struggles with random data
for (int i = 0; i < N; i++) {
if (data[i] > 128) sum += data[i]; // hard to predict if data is random
}
// Branch-free equivalent — always faster on random data
for (int i = 0; i < N; i++) {
sum += (data[i] > 128) ? data[i] : 0; // modern CPUs can often make this branch-free
}Compilers know about this and generate branch-free sequences using conditional
move instructions (cmov on x86) when they can prove it's safe.
Why cache locality matters in your code
Cache locality is the single most actionable architectural concept for application developers. It comes in two flavours:
Temporal locality — if you access a value, you are likely to access it again soon. Keep hot data in variables (registers/L1) rather than re-loading it from memory. Loop invariants pulled outside a loop exploit temporal locality.
Spatial locality — if you access one address, you are likely to access nearby addresses soon. Processing data in memory order exploits the fact that the entire cache line was already fetched.
The classic demonstration is matrix traversal. A matrix stored in row-major order
has rows laid out contiguously in memory. Iterating row-by-row (inner loop over
columns) is cache-friendly: you march through memory sequentially. Iterating
column-by-column (inner loop over rows) skips num_columns × element_size bytes
between accesses — every access is likely a cache miss.
# Row-major traversal — cache friendly (accesses elements in memory order)
total = 0
for row in matrix:
for val in row:
total += val
# Column-major traversal — cache hostile (jumps across rows in memory)
total = 0
for col in range(len(matrix[0])):
for row in range(len(matrix)):
total += matrix[row][col]On a large matrix this difference can be 5–10x in wall-clock time — not because the algorithm changes, but because the access pattern changes how many cache misses occur.
Data structure layout also matters. A struct-of-arrays (SoA) layout groups all values of one field together; an array-of-structs (AoS) layout groups all fields of one record together. If a hot loop reads only one field, SoA gives better spatial locality. If a loop reads all fields of each record in turn, AoS is better. Game engines and high-performance numerical code choose SoA or AoS deliberately for exactly this reason.
The CPU you are writing for today has out-of-order execution, hardware prefetchers, and speculative execution — mechanisms that mask some of the penalties described here. Do not guess: measure with a profiler before restructuring data layouts. Premature optimisation around cache misses in code that is not actually bottlenecked on memory wastes time and reduces readability.
Where to go next
Understanding how the CPU sees your code makes OS-level concepts click into place — processes, threads, context switching, and virtual memory all interact directly with the register and cache state described here. The Operating Systems track continues the story from where the hardware leaves off.
Knowledge check
- 1.In the fetch-decode-execute cycle, what does the program counter (PC) register hold?
- 2.When the CPU reads a single byte from DRAM, only that one byte is transferred into the cache.
- 3.Which of the following programming practices improve cache locality?