Designing for Failure Is Not Optional in Distributed Systems

by Arif Ikhsanudin, Backend Developer

The Assumption That Breaks Everything

Most systems are designed assuming components are available. The application calls the database, the database responds. The service calls the upstream API, the API returns data. This assumption is correct most of the time. When it is wrong — the database is temporarily unreachable, the upstream API times out — systems designed without explicit failure handling have undefined behavior.

Undefined behavior in production means: unknown recovery time, potential data corruption depending on what was in-flight, and engineers debugging under pressure with no runbook.

In a distributed system — anything that makes network calls — component unavailability is not exceptional. Networks partition. Processes crash. Cloud provider zones have incidents. A system with N components, each with 99.9% uptime, has a combined uptime of 0.999^N. At N=10 components, that is 99.0%. That means roughly 8.7 hours of combined failure time per year across the system.

Design for it.

The Failure Modes to Explicitly Handle

Timeouts. Every network call must have a timeout. An external service that stops responding does not return an error — it hangs. Without a timeout, the calling thread hangs indefinitely, holding resources (connections, memory, thread pool slots) until the operating system eventually closes the connection. Under load, this exhausts thread pools quickly.

Set timeouts at the operation level, not the library default. Library defaults are often dangerously long (30 seconds, 60 seconds). A timeout appropriate for your SLA is typically much shorter.

import httpx

# Never: library default (potentially 60+ seconds)
response = httpx.get("https://external-api.com/data")

# Always: explicit timeout
response = httpx.get(
    "https://external-api.com/data",
    timeout=httpx.Timeout(connect=2.0, read=5.0, write=2.0)
)

Circuit breakers. A service that repeatedly fails should not receive continued traffic that drains resources and fails users. A circuit breaker wraps calls to downstream services and tracks failure rate. After a threshold (e.g., 50% failure rate over 10 seconds), the circuit opens: subsequent calls fail fast without attempting the network call. After a timeout, the circuit half-opens and tests whether the downstream has recovered.

Resilience4j (JVM), Polly (.NET), and PyBreaker (Python) implement this pattern. The principle: fail fast when the downstream is known-bad, rather than queuing requests that will fail anyway.

Fallbacks. When a non-critical downstream dependency is unavailable, return a degraded response rather than an error. A product recommendation service that is down should not cause the product page to fail — it should cause the recommendations section to be empty, or served from a cached last-known result. Identifying which downstream dependencies are critical path (page fails without them) versus non-critical (page degrades gracefully) is a design decision that must happen before the incident.

Retry with exponential backoff and jitter. Transient failures — a momentary network hiccup, a brief timeout — are often recoverable with a retry. Immediate retry on failure can overwhelm a struggling service (retry storm). Exponential backoff with jitter — retry after 1s, then 2s, then 4s, with random jitter to spread retries across time — reduces retry pressure while allowing recovery.

The Practices That Prove the Design Works

Chaos testing. Netflix's Chaos Monkey terminates random production instances. The principle is that if you do not regularly test failure handling, you do not know if it works. Start smaller: chaos engineering tools like Gremlin or AWS Fault Injection Simulator let you simulate network latency, dependency unavailability, and instance termination in a controlled way. Test your fallbacks before your on-call engineer is testing them at 2am.

Runbooks for known failure modes. Every failure mode that has been identified in the design should have a documented recovery procedure. "What do we do if the payment service is down?" should have an answer that is findable in 60 seconds. A runbook is not a sign that the system is fragile — it is a sign that the team has thought about failure.

Design for failure explicitly. Every call that can fail will fail eventually. The question is whether the system has a plan.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

When Your Entire System Depends on One Person

It feels efficient when one person “just knows everything.” But when your entire system depends on them, you’re not efficient—you’re exposed.

Read more

Monitoring Is Not Optional. It Is How You Know Your App Is Alive.

A service without meaningful monitoring is a service you're flying blind on. You don't know if it's working, degrading, or failing — until a user tells you. That is not an acceptable operational posture.

Read more

How Singapore Scaleups Reduce Backend Overhead Efficiently

Your engineering team doubled last year. Your backend output didn't. Somewhere between the new hires and the new meetings, the actual building slowed down.

Read more

Rails Concerns — When They Help and When They Hurt

Rails concerns are one of the most misused features in the framework. Used correctly they share behavior cleanly across unrelated models. Used as a refactoring tool they just relocate complexity without reducing it.

Read more