Computer Architecture

How a CPU actually executes your code — the fetch-decode-execute cycle, registers, the memory hierarchy, and why cache locality is the single biggest performance lever you control.

Computer Architecture11 min read

By the end of this lesson you will be able to:

Describe the fetch-decode-execute cycle and what happens at each stage
Distinguish registers, L1/L2/L3 cache, RAM, and disk by speed and capacity
Explain what pipelining is and why branch mispredictions are costly
Define cache locality and give a concrete example of code that exploits it

When you write a loop in Python or C, you probably think of it as "doing work some number of times." The CPU sees something very different: a stream of individual instructions, each manipulating a handful of bytes inside a tiny set of storage cells called registers. Everything about performance — and a surprising amount about correctness — flows from understanding how that stream moves through the hardware.

The fetch-decode-execute cycle

Every CPU core repeats the same three-step cycle billions of times per second:

Fetch — The program counter (PC) register holds the address of the next instruction. The CPU loads that instruction from memory into an instruction register.
Decode — Specialised circuits parse the binary instruction: which operation is this? Which registers are the operands? Is there an immediate value embedded in the instruction?
Execute — The arithmetic-logic unit (ALU) or another functional unit performs the operation and writes the result back to a register (or to memory). The PC advances — or, for a branch, jumps — to the next instruction's address.

This cycle, in its pure form, takes one clock cycle per instruction on an ideal machine. Real CPUs are far from ideal, and the gap between the theoretical maximum and what code actually achieves is where almost all performance work happens.

Registers vs cache vs RAM vs disk

The memory hierarchy exists because fast storage is expensive and physically small:

Level         Size            Latency         Notes
-----------   ----------      -----------     --------------------------------
Registers     ~1 KB total     0 cycles        On-chip; ALU operands live here
L1 cache      32–64 KB        ~4 cycles       Per core; instruction + data split
L2 cache      256 KB–1 MB     ~12 cycles      Per core; unified
L3 cache      4–32 MB         ~40 cycles      Shared across cores on same chip
DRAM (RAM)    4–64 GB         ~200 cycles     Main memory; off-chip
SSD           256 GB–4 TB     ~50 000 cycles  NVMe, block-based
HDD           1–20 TB         ~10 000 000 cycles  Mechanical seek

The numbers are approximate and vary by CPU generation, but the ratios are the lesson: an L1 cache hit is fifty times faster than a DRAM access. A disk seek is fifty thousand times slower. When you hear someone say "this algorithm is fast," the real question is often "fast on what level of the hierarchy?"

Registers are the only operands the ALU can directly use. All computation reads from and writes to registers; loads and stores move data between registers and the cache/memory system. A modern desktop CPU has sixteen to thirty-two general-purpose integer registers and a similar number of floating-point/SIMD registers.

Cache is a set of small, fast memories that sit between the registers and DRAM. Cache operates on cache lines — fixed-size blocks, typically 64 bytes. When a load instruction misses the L1 cache, the hardware fetches an entire 64- byte line from the next level of the hierarchy, even if you only wanted 4 bytes. This prefetches nearby data, which is great if your access pattern is sequential (you're likely to use those nearby bytes) and wasteful if it isn't.

An L1 cache miss costs roughly 40–200 cycles depending on whether the data is in L2, L3, or DRAM. A function that does 5 arithmetic operations but suffers one DRAM miss per iteration is bottlenecked on memory, not the CPU. Profilers that count cache misses (Linux perf, Intel VTune) reveal this directly.

Pipelining and branch prediction

To get more work done per clock, every modern CPU uses a pipeline: it overlaps the fetch, decode, and execute stages of multiple instructions. While instruction N is executing, instruction N+1 is being decoded and instruction N+2 is being fetched. A typical desktop pipeline has 14–20 stages, meaning up to 20 instructions are in flight simultaneously.

The pipeline breaks down at branch instructions (if, while, loop conditions). When the CPU fetches the instruction after a branch, it doesn't yet know which way the branch will go — the condition hasn't been evaluated yet. The solution is branch prediction: the CPU guesses which direction the branch will take based on historical patterns, and starts executing speculatively along the predicted path.

If the prediction is correct, no time is lost. If it is wrong, the speculatively executed instructions must be discarded (pipeline flush), and execution restarts from the correct path. A misprediction on a 20-stage pipeline throws away roughly 15 cycles of work. At a billion iterations per second, a 1% misprediction rate costs tens of millions of wasted cycles per second.

// Branch-heavy — predictor struggles with random data
for (int i = 0; i < N; i++) {
    if (data[i] > 128) sum += data[i];  // hard to predict if data is random
}

// Branch-free equivalent — always faster on random data
for (int i = 0; i < N; i++) {
    sum += (data[i] > 128) ? data[i] : 0;  // modern CPUs can often make this branch-free
}

Compilers know about this and generate branch-free sequences using conditional move instructions (cmov on x86) when they can prove it's safe.

Why cache locality matters in your code

Cache locality is the single most actionable architectural concept for application developers. It comes in two flavours:

Temporal locality — if you access a value, you are likely to access it again soon. Keep hot data in variables (registers/L1) rather than re-loading it from memory. Loop invariants pulled outside a loop exploit temporal locality.

Spatial locality — if you access one address, you are likely to access nearby addresses soon. Processing data in memory order exploits the fact that the entire cache line was already fetched.

The classic demonstration is matrix traversal. A matrix stored in row-major order has rows laid out contiguously in memory. Iterating row-by-row (inner loop over columns) is cache-friendly: you march through memory sequentially. Iterating column-by-column (inner loop over rows) skips num_columns × element_size bytes between accesses — every access is likely a cache miss.

# Row-major traversal — cache friendly (accesses elements in memory order)
total = 0
for row in matrix:
    for val in row:
        total += val

# Column-major traversal — cache hostile (jumps across rows in memory)
total = 0
for col in range(len(matrix[0])):
    for row in range(len(matrix)):
        total += matrix[row][col]

On a large matrix this difference can be 5–10x in wall-clock time — not because the algorithm changes, but because the access pattern changes how many cache misses occur.

Data structure layout also matters. A struct-of-arrays (SoA) layout groups all values of one field together; an array-of-structs (AoS) layout groups all fields of one record together. If a hot loop reads only one field, SoA gives better spatial locality. If a loop reads all fields of each record in turn, AoS is better. Game engines and high-performance numerical code choose SoA or AoS deliberately for exactly this reason.

The CPU you are writing for today has out-of-order execution, hardware prefetchers, and speculative execution — mechanisms that mask some of the penalties described here. Do not guess: measure with a profiler before restructuring data layouts. Premature optimisation around cache misses in code that is not actually bottlenecked on memory wastes time and reduces readability.

Where to go next

Understanding how the CPU sees your code makes OS-level concepts click into place — processes, threads, context switching, and virtual memory all interact directly with the register and cache state described here. The Operating Systems track continues the story from where the hardware leaves off.

Knowledge check

Finished reading? Mark it complete to track your progress.