Why distributed is hard
The moment a system spans more than one machine, new and unavoidable problems appear.
- Explain why a network changes everything
- List the fallacies of distributed computing
- Understand the CAP trade-off at a high level
Inside one process, function calls are instant and reliable. The instant your system spans two machines talking over a network, that reliability vanishes, and a whole category of hard problems appears. Most production complexity lives here, so it pays to understand why it's hard before reaching for distributed designs.
The network changes everything
A local call either returns or your whole program crashes. A network call has a third outcome that ruins everything: it can just not answer. Did the request not arrive? Did it succeed but the reply get lost? You often can't tell — and that ambiguity is the root of distributed-systems difficulty (the idempotency point from the APIs lesson).
The fallacies of distributed computing
A famous list of assumptions that feel true and are all false:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change. (...and more.)
Every one of these seems safe when you test locally and bites in production. Designing distributed systems is largely the discipline of not believing them.
The CAP trade-off
When a network partition happens (machines can't reach each other — a when, not an if), a distributed data store must choose between:
- Consistency — every read sees the latest write (or an error), or
- Availability — every request gets an answer, possibly stale.
You can't have both during a partition — that's the heart of the CAP theorem. Real systems pick where to land: a bank leans consistent (refuse rather than show wrong balances); a social feed leans available (a slightly stale feed is fine). The point isn't the theorem's formalism — it's that these trade-offs are forced, so choose them deliberately.
The strongest practical advice: avoid distributing until you must. A single, well-built service is dramatically simpler than a distributed one. Reach for distribution to solve a real scaling or reliability need — not by default.
Where to go next
The first hard problem in detail: keeping state and consistency straight when data lives in more than one place.