Retry Logic Sounds Simple Until It Makes Things Worse

by Arif Ikhsanudin, Backend Developer

How retries can make an outage worse

Payment Service goes down for 90 seconds. Order Service, which calls Payment Service synchronously, retries every failed request three times with a 1-second delay. Order Service is handling 50 requests per second. During the 90-second outage, approximately 13,500 requests accumulate. When Payment Service recovers, it immediately receives those 13,500 queued retries — roughly 150 requests per second — on top of the 50 new requests per second arriving normally. Payment Service, which just recovered from an overload condition, is now receiving 3x its normal traffic. It goes back down. The cycle continues.

This is the thundering herd problem, and it is caused by retry logic that doesn't account for the systemic effect of many callers retrying simultaneously.

The four components of correct retry behavior

Exponential backoff: each retry waits longer than the previous one. If the first retry waits 100ms, the second waits 200ms, the third waits 400ms. This reduces the retry rate over time and gives the downstream service space to recover.

Jitter: add randomness to the backoff interval. Without jitter, all instances of a caller retry at the same intervals — 100ms, 200ms, 400ms — producing synchronized retry bursts. With jitter (full jitter: random value between 0 and the backoff interval), retries spread across the recovery period:

// Exponential backoff with full jitter
long computeBackoff(int attempt, long baseMs, long maxMs) {
    long exponential = (long) (baseMs * Math.pow(2, attempt));
    long capped = Math.min(exponential, maxMs);
    return (long) (Math.random() * capped); // full jitter
}

AWS's SDKs use this pattern by default. Most HTTP client libraries (OkHttp, Apache HttpClient) require you to configure it explicitly.

Retry budget: limit total retry attempts. Three retries is a common maximum. Beyond that, the request is likely failing for a reason that waiting won't fix, and you're just consuming resources. Some teams implement retry budgets at the service level rather than per-request — if more than 10% of requests in a window are retries, stop retrying entirely and let the circuit breaker take over.

Idempotency: retries are only safe if the operation being retried is idempotent — repeating it has the same effect as doing it once. GET requests are inherently idempotent. POST requests that create resources are not — without idempotency protection, a network timeout after a successful payment charge, followed by a retry, charges the user twice.

The correct pattern for idempotent mutations uses an idempotency key sent with the request:

POST /payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json

{
  "orderId": "order-123",
  "amount": 99.99,
  "currency": "USD"
}

The server stores processed requests by idempotency key. If the same key arrives again (from a retry), it returns the stored result without re-executing the payment. The client generates the key before the first attempt and reuses it on every retry:

String idempotencyKey = UUID.randomUUID().toString(); // generated once per logical operation
for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
    try {
        return paymentClient.charge(request, idempotencyKey);
    } catch (TransientException e) {
        if (attempt == MAX_RETRIES - 1) throw e;
        Thread.sleep(computeBackoff(attempt, 100, 5000));
    }
}

What not to retry

Not every failure should be retried. Retrying a 400 Bad Request (client error) is pointless — the request is malformed and retrying with the same payload produces the same error. Retrying a 409 Conflict (business rule violation: item out of stock) wastes resources. Only transient errors — 503 Service Unavailable, 429 Too Many Requests, network timeouts, connection resets — benefit from retry.

public boolean isRetryable(Exception ex) {
    if (ex instanceof FeignException feignEx) {
        int status = feignEx.status();
        return status == 503 || status == 429 || status == 502 || status == 504;
    }
    if (ex instanceof SocketTimeoutException || ex instanceof ConnectException) {
        return true;
    }
    return false;
}

For 429 Too Many Requests specifically, the server may include a Retry-After header indicating how long to wait. Honor it:

if (ex instanceof FeignException feignEx && feignEx.status() == 429) {
    String retryAfter = feignEx.responseHeaders()
        .getOrDefault("Retry-After", List.of("1")).get(0);
    Thread.sleep(Long.parseLong(retryAfter) * 1000);
}

Retry and circuit breakers together

Retries and circuit breakers work best together. The circuit breaker prevents retrying against a known-down service (stops calls immediately rather than retrying into a black hole). The retry logic handles transient blips before the circuit breaker threshold is hit. Configure them with the right relationship:

  • Retry: 3 attempts with exponential backoff + jitter
  • Circuit breaker: opens after 50% failure rate in a 20-call sliding window

With these settings, a genuine service outage triggers the circuit breaker (preventing retries from amplifying load) while brief transient failures (connection resets, single slow responses) are handled by retries before the circuit threshold is reached.

Test this combination explicitly in staging by injecting failures at different rates and durations. Verify that transient failures (< 30 seconds) are handled by retries without triggering the circuit breaker, and genuine outages (> 60 seconds) open the circuit breaker and stop retry amplification.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

The Most Common Warning Signs in Failing Software Projects

Software projects rarely fail without leaving breadcrumbs. Spotting these early signs can help you steer back on course before it’s too late.

Read more

Interactive Rebase: The Git Feature That Keeps Your History Clean

Interactive rebase lets you rewrite your local commit history before sharing it — squashing fixup commits, reordering changes, and splitting work into logical units that make code review and archaeology easier.

Read more

ActiveRecord Query Patterns That Actually Scale

ActiveRecord makes simple queries trivial and complex queries dangerous. These are the patterns that remain correct under load — and the common ones that quietly fall apart at scale.

Read more

What a Professional Contract Should Cover Before You Start Any Work

A contract is not bureaucracy. It is the document that prevents the most predictable and painful problems in contracting — the ones that come up in every engagement that does not have one.

Read more