Monitoring and alerting
Turn signals into timely warnings — without drowning in noise.
- Define SLIs and SLOs and why they anchor alerts
- Alert on symptoms users feel, not every internal metric
- Avoid alert fatigue
Monitoring watches your metrics and logs; alerting notifies a human when something needs attention. The hard part isn't collecting data — it's deciding what's worth waking someone up for. Good alerting catches real problems early; bad alerting trains everyone to ignore it.
Measure what users feel: SLIs and SLOs
Anchor monitoring in the user's experience, not internal trivia:
- An SLI (Service Level Indicator) is a metric that reflects user experience — request latency, error rate, availability.
- An SLO (Service Level Objective) is the target for an SLI — e.g. "99.9% of requests succeed" or "95% complete under 300ms."
SLOs give alerts an objective threshold and give the team a shared definition of "healthy" — and explicit permission to not chase perfection beyond the target.
Alert on symptoms, not causes
Alert on what the user feels — error rate up, latency up, the site down — not on every internal fluctuation ("CPU at 80%"). High CPU might be fine; a spike in failed checkouts is never fine. Symptom-based alerts catch real problems regardless of cause and don't fire on harmless internal noise.
A useful frame: page a human for things that are urgent and actionable; everything else goes to a dashboard or a ticket, not someone's phone at night.
Beware alert fatigue
The fastest way to make monitoring useless is too many alerts. When alerts fire constantly — especially false ones — people stop trusting them and miss the real one. Every alert should be actionable: if there's nothing to do about it, it's not an alert, it's noise. Prune relentlessly.
An alert that fires and is ignored is worse than no alert — it adds noise and erodes trust in the whole system. If an alert isn't actionable and urgent, turn it into a dashboard panel instead.
Where to go next
When an alert does fire on something real, you need a calm process. Next: incident response.