Designing for Failure Is Not Pessimism. It Is Professionalism.

by Arif Ikhsanudin, Backend Developer

The Optimistic Architecture Problem

Most systems are designed for the happy path. Services call each other, databases respond, messages are processed, and everything works. Architecture diagrams show arrows between boxes, implying reliable connections. The failure cases — timeouts, partial responses, network partitions, dependency outages — are addressed in sprint retros after the incident, not in the design session before it.

This is not negligence. It's a natural consequence of building under time pressure while requirements are focused on functionality. But it means that the failure behavior of a system — how it behaves when things go wrong — is typically undesigned. And undesigned failure behavior is almost always worse than any deliberate alternative.

Failures Are Not Edge Cases at Scale

At small scale, the probability of any given dependency failing on any given request is negligible. As you scale — more traffic, more services, longer dependency chains — the aggregate failure probability grows. Google's original SRE book noted that in a system with 100 components each at 99.9% availability, the theoretical maximum availability of a request that touches all of them is 99.9%^100 ≈ 90.5%. Infrastructure that feels reliable at small scale becomes regularly unreliable at production scale.

Michael T. Nygard's Release It! (Pragmatic Programmers) systematizes this observation: cascading failures, where one service's degradation takes down everything that depends on it, are not exotic. They are the default behavior of systems that weren't designed to contain failure.

The Patterns That Matter

Circuit Breakers

A circuit breaker monitors requests to a downstream dependency. When failures exceed a threshold, it "opens" — subsequent requests fail immediately without attempting the downstream call. After a configurable period, it allows a test request through. If it succeeds, the circuit closes. If not, it stays open.

This prevents a failing dependency from consuming your thread pool with requests that will all fail anyway. Resilience4j is the standard implementation for JVM applications; it provides circuit breaker, retry, rate limiter, and bulkhead functionality with metrics integration.

@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult charge(PaymentRequest request) {
    return paymentClient.charge(request);
}

public PaymentResult paymentFallback(PaymentRequest request, Exception e) {
    // Queue for async retry rather than failing the user's request
    paymentRetryQueue.enqueue(request);
    return PaymentResult.pendingRetry(request.getOrderId());
}

Timeouts Everywhere

Every network call must have a timeout. No exceptions. A call without a timeout will block your thread until the downstream service responds or the connection is reset by the OS — potentially minutes. In a thread-per-request model, ten threads waiting on a slow service equals ten concurrent requests blocked.

Timeouts need to be set at multiple levels: connection establishment, initial response, and total request duration. Most HTTP clients default to no timeout or an absurdly long one. Set them explicitly and document why the specific value was chosen.

Bulkheads

Named after the compartmentalized sections of a ship's hull that prevent flooding in one compartment from sinking the vessel. A bulkhead in software isolates resource pools so that a failing dependency can only consume the resources allocated to it.

A service that calls three different downstream APIs without bulkheads can have all three thread pools exhausted by one failing API. With bulkheads, each downstream dependency has a dedicated connection pool. One API going slow degrades that API's feature set, not everything.

Retries With Exponential Backoff and Jitter

Retries are necessary for transient failures. Retries without backoff can amplify load on an already struggling service. Retries without jitter cause all clients to retry simultaneously, creating thundering herds.

RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(100))
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(100, 2, 0.5))
    .retryOnException(e -> e instanceof TransientException)
    .build();

The jitter (0.5 here — 50% random variation) spreads retries across time, reducing the chance that all clients retry at the same instant.

Designing for Partial Availability

Not every feature needs the same availability. When a downstream service is degraded, you have options beyond "return an error to the user":

  • Serve stale cached data with a staleness indicator
  • Disable the feature that depends on the service (hide the recommendation widget, disable the social sharing button)
  • Return a degraded but functional response (show the product page without personalized pricing)

These decisions should be made before the outage, not during it. For each service dependency, decide: when this is unavailable, what is the fallback behavior? Document it. Implement it. Test it with a feature flag before you need it.

The Practical Takeaway

Pick one downstream dependency your service calls synchronously and answer: what happens to your service when that dependency takes 10 seconds to respond? What happens when it fails completely? If the answer is "our service also fails," implement a timeout and circuit breaker before the next release. That single change will prevent the most common category of cascading failure in your system.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

What Being a Tech Lead Taught Me About Writing Better Code

The fastest way to understand what makes code good or bad is to become responsible for code you didn't write. Tech leading changed how I write permanently.

Read more

Citadel and CME Group Pay Chicago's Backend Developers More Than Most Startups Can Afford

Chicago has world-class backend engineering talent. The financial firms that employ most of it have built compensation structures specifically designed to keep it.

Read more

API Gateways in Spring Boot — What They Do, When You Need One, and How to Configure Spring Cloud Gateway

An API gateway is a single entry point that handles cross-cutting concerns — routing, authentication, rate limiting, and observability — so individual services don't have to. Spring Cloud Gateway is the Spring-native implementation. Here is what it solves and how to configure it.

Read more

How to Recognize a Failing Software Project Early

Not all disasters happen overnight. Sometimes, projects fail slowly, and the warning signs are subtle. Spotting them early can save you money, time, and a lot of frustration.

Read more