Designing for failure

In a single program, a failure usually crashes everything at once. In a distributed system, parts fail independently and constantly — a service is down, a network call hangs, a disk fills. The shift in mindset is decisive: stop assuming things work, and design for fault tolerance so that when a piece fails, the whole keeps serving.

Assume everything fails

At scale, something is always broken somewhere. A robust system treats every external call — to another service, a database, a queue — as something that might be slow, fail, or never answer (the distributed fallacies). Each such call needs a plan for the bad case, not just the happy path.

The core resilience patterns

A small toolkit handles most of it:

Timeouts: never wait forever. A call without a timeout can hang and exhaust your resources, turning one slow dependency into a full outage. Always bound the wait.
Retries with backoff: transient failures often succeed on a second try — but retry with increasing delays (and jitter), or you'll hammer a struggling service and make it worse. Retry only idempotent operations (the consistency lesson).
Circuit breakers: if a dependency is clearly down, stop calling it for a while and fail fast, giving it room to recover instead of piling on.
Bulkheads: isolate resources so one overloaded part can't sink the rest — like watertight compartments in a ship.

Graceful degradation

Aim to degrade, not collapse. If the recommendations service is down, show a generic list rather than an error page. If search is slow, return cached results. A system that loses a feature under stress is far better than one that goes entirely dark — partial service beats no service.

Some teams deliberately inject failures in production (chaos engineering) to prove their systems degrade gracefully. The mindset underneath is the lesson: failure isn't an exception to handle someday — it's the normal condition you design for from the start.

Where to go next

That completes Distributed Systems. The final advanced module tackles the reason many teams distribute in the first place: Scaling & Performance.

Finished reading? Mark it complete to track your progress.

Assume everything fails

The core resilience patterns

Graceful degradation

Where to go next

On this page