Designing for Failure Is Not Optional in Distributed Systems
by Arif Ikhsanudin, Backend Developer
The Assumption That Breaks Everything
Most systems are designed assuming components are available. The application calls the database, the database responds. The service calls the upstream API, the API returns data. This assumption is correct most of the time. When it is wrong — the database is temporarily unreachable, the upstream API times out — systems designed without explicit failure handling have undefined behavior.
Undefined behavior in production means: unknown recovery time, potential data corruption depending on what was in-flight, and engineers debugging under pressure with no runbook.
In a distributed system — anything that makes network calls — component unavailability is not exceptional. Networks partition. Processes crash. Cloud provider zones have incidents. A system with N components, each with 99.9% uptime, has a combined uptime of 0.999^N. At N=10 components, that is 99.0%. That means roughly 8.7 hours of combined failure time per year across the system.
Design for it.
The Failure Modes to Explicitly Handle
Timeouts. Every network call must have a timeout. An external service that stops responding does not return an error — it hangs. Without a timeout, the calling thread hangs indefinitely, holding resources (connections, memory, thread pool slots) until the operating system eventually closes the connection. Under load, this exhausts thread pools quickly.
Set timeouts at the operation level, not the library default. Library defaults are often dangerously long (30 seconds, 60 seconds). A timeout appropriate for your SLA is typically much shorter.
import httpx
# Never: library default (potentially 60+ seconds)
response = httpx.get("https://external-api.com/data")
# Always: explicit timeout
response = httpx.get(
"https://external-api.com/data",
timeout=httpx.Timeout(connect=2.0, read=5.0, write=2.0)
)
Circuit breakers. A service that repeatedly fails should not receive continued traffic that drains resources and fails users. A circuit breaker wraps calls to downstream services and tracks failure rate. After a threshold (e.g., 50% failure rate over 10 seconds), the circuit opens: subsequent calls fail fast without attempting the network call. After a timeout, the circuit half-opens and tests whether the downstream has recovered.
Resilience4j (JVM), Polly (.NET), and PyBreaker (Python) implement this pattern. The principle: fail fast when the downstream is known-bad, rather than queuing requests that will fail anyway.
Fallbacks. When a non-critical downstream dependency is unavailable, return a degraded response rather than an error. A product recommendation service that is down should not cause the product page to fail — it should cause the recommendations section to be empty, or served from a cached last-known result. Identifying which downstream dependencies are critical path (page fails without them) versus non-critical (page degrades gracefully) is a design decision that must happen before the incident.
Retry with exponential backoff and jitter. Transient failures — a momentary network hiccup, a brief timeout — are often recoverable with a retry. Immediate retry on failure can overwhelm a struggling service (retry storm). Exponential backoff with jitter — retry after 1s, then 2s, then 4s, with random jitter to spread retries across time — reduces retry pressure while allowing recovery.
The Practices That Prove the Design Works
Chaos testing. Netflix's Chaos Monkey terminates random production instances. The principle is that if you do not regularly test failure handling, you do not know if it works. Start smaller: chaos engineering tools like Gremlin or AWS Fault Injection Simulator let you simulate network latency, dependency unavailability, and instance termination in a controlled way. Test your fallbacks before your on-call engineer is testing them at 2am.
Runbooks for known failure modes. Every failure mode that has been identified in the design should have a documented recovery procedure. "What do we do if the payment service is down?" should have an answer that is findable in 60 seconds. A runbook is not a sign that the system is fragile — it is a sign that the team has thought about failure.
Design for failure explicitly. Every call that can fail will fail eventually. The question is whether the system has a plan.