Designing for Failure Is Not Pessimism. It Is Professionalism.
by Arif Ikhsanudin, Backend Developer
The Optimistic Architecture Problem
Most systems are designed for the happy path. Services call each other, databases respond, messages are processed, and everything works. Architecture diagrams show arrows between boxes, implying reliable connections. The failure cases — timeouts, partial responses, network partitions, dependency outages — are addressed in sprint retros after the incident, not in the design session before it.
This is not negligence. It's a natural consequence of building under time pressure while requirements are focused on functionality. But it means that the failure behavior of a system — how it behaves when things go wrong — is typically undesigned. And undesigned failure behavior is almost always worse than any deliberate alternative.
Failures Are Not Edge Cases at Scale
At small scale, the probability of any given dependency failing on any given request is negligible. As you scale — more traffic, more services, longer dependency chains — the aggregate failure probability grows. Google's original SRE book noted that in a system with 100 components each at 99.9% availability, the theoretical maximum availability of a request that touches all of them is 99.9%^100 ≈ 90.5%. Infrastructure that feels reliable at small scale becomes regularly unreliable at production scale.
Michael T. Nygard's Release It! (Pragmatic Programmers) systematizes this observation: cascading failures, where one service's degradation takes down everything that depends on it, are not exotic. They are the default behavior of systems that weren't designed to contain failure.
The Patterns That Matter
Circuit Breakers
A circuit breaker monitors requests to a downstream dependency. When failures exceed a threshold, it "opens" — subsequent requests fail immediately without attempting the downstream call. After a configurable period, it allows a test request through. If it succeeds, the circuit closes. If not, it stays open.
This prevents a failing dependency from consuming your thread pool with requests that will all fail anyway. Resilience4j is the standard implementation for JVM applications; it provides circuit breaker, retry, rate limiter, and bulkhead functionality with metrics integration.
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult charge(PaymentRequest request) {
return paymentClient.charge(request);
}
public PaymentResult paymentFallback(PaymentRequest request, Exception e) {
// Queue for async retry rather than failing the user's request
paymentRetryQueue.enqueue(request);
return PaymentResult.pendingRetry(request.getOrderId());
}
Timeouts Everywhere
Every network call must have a timeout. No exceptions. A call without a timeout will block your thread until the downstream service responds or the connection is reset by the OS — potentially minutes. In a thread-per-request model, ten threads waiting on a slow service equals ten concurrent requests blocked.
Timeouts need to be set at multiple levels: connection establishment, initial response, and total request duration. Most HTTP clients default to no timeout or an absurdly long one. Set them explicitly and document why the specific value was chosen.
Bulkheads
Named after the compartmentalized sections of a ship's hull that prevent flooding in one compartment from sinking the vessel. A bulkhead in software isolates resource pools so that a failing dependency can only consume the resources allocated to it.
A service that calls three different downstream APIs without bulkheads can have all three thread pools exhausted by one failing API. With bulkheads, each downstream dependency has a dedicated connection pool. One API going slow degrades that API's feature set, not everything.
Retries With Exponential Backoff and Jitter
Retries are necessary for transient failures. Retries without backoff can amplify load on an already struggling service. Retries without jitter cause all clients to retry simultaneously, creating thundering herds.
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(100))
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(100, 2, 0.5))
.retryOnException(e -> e instanceof TransientException)
.build();
The jitter (0.5 here — 50% random variation) spreads retries across time, reducing the chance that all clients retry at the same instant.
Designing for Partial Availability
Not every feature needs the same availability. When a downstream service is degraded, you have options beyond "return an error to the user":
- Serve stale cached data with a staleness indicator
- Disable the feature that depends on the service (hide the recommendation widget, disable the social sharing button)
- Return a degraded but functional response (show the product page without personalized pricing)
These decisions should be made before the outage, not during it. For each service dependency, decide: when this is unavailable, what is the fallback behavior? Document it. Implement it. Test it with a feature flag before you need it.
The Practical Takeaway
Pick one downstream dependency your service calls synchronously and answer: what happens to your service when that dependency takes 10 seconds to respond? What happens when it fails completely? If the answer is "our service also fails," implement a timeout and circuit breaker before the next release. That single change will prevent the most common category of cascading failure in your system.