Database Internals

SQL looks declarative: you say what you want, not how to get it. Behind the scenes the database engine is doing a remarkable amount of work — choosing access paths, maintaining indexes, writing to disk in a way that survives crashes, and serving dozens of concurrent queries without corrupting each other's data. This lesson opens the hood on the mechanisms that make all of that possible.

How B-tree indexes work

The default index type in virtually every relational database (PostgreSQL, MySQL, SQLite, SQL Server) is the B-tree (balanced tree). A B-tree stores key-value pairs in sorted order across a tree of fixed-size pages (typically 8 KB). Each internal node holds multiple keys and pointers to child pages; the leaves hold the actual keys paired with pointers to the heap rows.

                [30 | 60]
               /    |    \
        [10|20]  [40|50]  [70|80]
           |        |        |
         heap     heap     heap
         rows     rows     rows

Because the tree is balanced, every search path from root to leaf has the same length. For a table with n rows, the tree height is O(log_b n) where b is the branching factor (number of keys per node). A B-tree with 8 KB pages, 8-byte keys, and 8-byte pointers can fit roughly 500 key-pointer pairs per node. A billion-row table needs a tree of height log_500(1,000,000,000) ≈ 3.2 — four pages to access any row. Those four pages can fit in memory; the lookup is extremely fast even without caching.

A range scan (e.g., WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31') also uses the B-tree efficiently. The engine descends to the first matching leaf, then follows the linked list of leaf pages (leaves are doubly-linked in a B+ tree) until the range is exhausted — without re-traversing the tree.

The cost of an index is the maintenance overhead: every INSERT, UPDATE, or DELETE must also update every index on that table. A table with ten indexes pays roughly ten times the write cost of a table with one index. Indexes are not free; add them where queries need them and remove those that are never used.

Heap scans vs index scans

The heap is where actual row data is stored — an unordered collection of pages containing the full rows. An index is a separate data structure that provides sorted access to a subset of columns plus pointers back to the heap.

A heap scan (also called a sequential scan or full table scan) reads every page of the heap in order. It is simple and predictable. When it is faster: if the query needs most of the rows (a low-selectivity condition like WHERE status != 'deleted' on a table where 90% of rows match), reading the heap sequentially is cheaper than bouncing around the heap following index pointers.

An index scan descends the B-tree to find matching key values, then fetches the heap rows by their pointers. Each pointer lookup is a random I/O into the heap. When most of the heap is cached this is fine; when the heap is larger than RAM, each random access may cause a disk seek.

A bitmap index scan (PostgreSQL, others) is a hybrid: traverse the index to build a bitmap of matching heap page numbers, sort the page numbers, then read the heap in order. This converts random I/O to sequential I/O at the cost of some memory for the bitmap.

An index-only scan reads data directly from the index leaf pages, never touching the heap — but only if every column in the query is stored in the index. This is where SELECT * hurts: adding a wildcard column that is not in the index forces the engine to fetch the heap, turning a fast index-only scan into a slower index-then-heap scan.

-- Index on (user_id) covers this query — index-only scan possible
SELECT user_id FROM orders WHERE user_id = 42;

-- SELECT * forces a heap fetch even though user_id is indexed
SELECT * FROM orders WHERE user_id = 42;

-- Covering index makes SELECT * equivalent-ish for read-heavy queries
CREATE INDEX idx_orders_user_covering ON orders (user_id) INCLUDE (status, total);

The query planner makes the decision between heap scan and index scan based on statistics it maintains about the data distribution. ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) updates these statistics. If a planner is choosing a bad plan, stale statistics are the first thing to check.

ACID properties

Databases make four durability and consistency guarantees, collectively known as ACID:

Atomicity — a transaction is all-or-nothing. If you transfer money between two bank accounts (UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2;), either both updates commit together or neither does. A crash or error mid-transaction leaves the database as if the transaction never started.

Consistency — a committed transaction always moves the database from one valid state to another valid state, as defined by the schema's constraints (foreign keys, check constraints, not-null constraints). The database never commits a row that violates a foreign key reference.

Isolation — concurrent transactions do not see each other's uncommitted work. If transaction A is updating a row and transaction B reads that row simultaneously, B sees the row as it was before A started — not A's partial update. The degree of isolation (and the corresponding performance cost) is configurable via isolation levels (READ COMMITTED, REPEATABLE READ, SERIALIZABLE).

Durability — once a transaction is committed, its effects survive any subsequent crash. The database guarantees it can replay the committed data even if the server loses power the millisecond after COMMIT returns.

Write-ahead logging (WAL)

Durability is hard to implement efficiently. Writing data directly to its final location on disk — modifying a B-tree page in place — is dangerous: a crash mid-write leaves a corrupted page. And random writes to scattered disk locations are slow.

The solution is write-ahead logging (WAL). Before any page is modified, a description of the change (a log record) is written to an append-only WAL file on disk. Append operations are sequential and fast. Only after the log record is durably on disk does the engine modify the actual data page (which it may keep in a memory buffer for a while before flushing).

COMMIT transaction T
  → write log record: "T committed; row 42 changed from X to Y"
  → fsync (ensure log is on disk)
  → return success to client
  (data page update happens later, asynchronously)

On crash recovery, the engine replays the WAL from the last checkpoint forward. Any transaction marked committed in the log but not yet reflected in the data files is replayed (redo). Any transaction that was in-progress at crash time is rolled back (undo). The database returns to a consistent state without having lost any committed work.

WAL also powers streaming replication: the primary sends its WAL stream to replica servers, which replay it to stay in sync. The replica's data is guaranteed to be consistent precisely because WAL records are self-contained descriptions of state changes.

MVCC for concurrent readers and writers

A naive way to implement isolation is to lock every row being read. This is simple but slow: a long-running read holds a lock that blocks all writers.

Multi-Version Concurrency Control (MVCC) takes a different approach: instead of locking, the database keeps multiple versions of each row. When a transaction modifies a row, it writes a new version tagged with that transaction's timestamp (or transaction ID), leaving the old version in place. Other transactions that started before the modification see the old version; transactions that start after the commit see the new version.

Row 42 at t=100: { status: 'pending',  xmin: 50,  xmax: 110 }
Row 42 at t=110: { status: 'shipped',  xmin: 110, xmax: null }

Transaction started at t=105 reads Row 42 → sees 'pending' (old version still valid)
Transaction started at t=115 reads Row 42 → sees 'shipped' (new version)

The consequence: readers never block writers and writers never block readers. Concurrent reads and writes proceed in parallel without contention. The cost is storage: old row versions must be retained until no active transaction can still reference them. PostgreSQL's VACUUM process (and MySQL's InnoDB purge thread) periodically remove stale versions that no running transaction can see.

MVCC solves most read-write contention but does not eliminate all anomalies. Write skew (two transactions each read disjoint data and make decisions that together violate a constraint) can still occur at the REPEATABLE READ level. Full SERIALIZABLE isolation prevents write skew but uses predicate locks or optimistic concurrency checks that add overhead. Choose the isolation level deliberately — defaulting to SERIALIZABLE everywhere is rarely necessary and always costly.

Where to go next

Database engines build on nearly every concept in these tracks: they exploit the memory hierarchy (Computer Architecture) for buffer pool management, cross the user/kernel boundary (Operating Systems) on every disk write, and receive queries over network connections (Networking Fundamentals). The SQL track applies these internals directly: after reading this lesson, the SQL intermediate modules on indexes and query planning will make considerably more sense.

Knowledge check

Finished reading? Mark it complete to track your progress.

How B-tree indexes work

Heap scans vs index scans

ACID properties

Write-ahead logging (WAL)

MVCC for concurrent readers and writers

Where to go next

How B-tree indexes work

Heap scans vs index scans

ACID properties

Write-ahead logging (WAL)

MVCC for concurrent readers and writers

Where to go next

Knowledge check

On this page