Retry Logic Sounds Simple Until It Makes Things Worse
by Arif Ikhsanudin, Backend Developer
How retries can make an outage worse
Payment Service goes down for 90 seconds. Order Service, which calls Payment Service synchronously, retries every failed request three times with a 1-second delay. Order Service is handling 50 requests per second. During the 90-second outage, approximately 13,500 requests accumulate. When Payment Service recovers, it immediately receives those 13,500 queued retries — roughly 150 requests per second — on top of the 50 new requests per second arriving normally. Payment Service, which just recovered from an overload condition, is now receiving 3x its normal traffic. It goes back down. The cycle continues.
This is the thundering herd problem, and it is caused by retry logic that doesn't account for the systemic effect of many callers retrying simultaneously.
The four components of correct retry behavior
Exponential backoff: each retry waits longer than the previous one. If the first retry waits 100ms, the second waits 200ms, the third waits 400ms. This reduces the retry rate over time and gives the downstream service space to recover.
Jitter: add randomness to the backoff interval. Without jitter, all instances of a caller retry at the same intervals — 100ms, 200ms, 400ms — producing synchronized retry bursts. With jitter (full jitter: random value between 0 and the backoff interval), retries spread across the recovery period:
// Exponential backoff with full jitter
long computeBackoff(int attempt, long baseMs, long maxMs) {
long exponential = (long) (baseMs * Math.pow(2, attempt));
long capped = Math.min(exponential, maxMs);
return (long) (Math.random() * capped); // full jitter
}
AWS's SDKs use this pattern by default. Most HTTP client libraries (OkHttp, Apache HttpClient) require you to configure it explicitly.
Retry budget: limit total retry attempts. Three retries is a common maximum. Beyond that, the request is likely failing for a reason that waiting won't fix, and you're just consuming resources. Some teams implement retry budgets at the service level rather than per-request — if more than 10% of requests in a window are retries, stop retrying entirely and let the circuit breaker take over.
Idempotency: retries are only safe if the operation being retried is idempotent — repeating it has the same effect as doing it once. GET requests are inherently idempotent. POST requests that create resources are not — without idempotency protection, a network timeout after a successful payment charge, followed by a retry, charges the user twice.
The correct pattern for idempotent mutations uses an idempotency key sent with the request:
POST /payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json
{
"orderId": "order-123",
"amount": 99.99,
"currency": "USD"
}
The server stores processed requests by idempotency key. If the same key arrives again (from a retry), it returns the stored result without re-executing the payment. The client generates the key before the first attempt and reuses it on every retry:
String idempotencyKey = UUID.randomUUID().toString(); // generated once per logical operation
for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
try {
return paymentClient.charge(request, idempotencyKey);
} catch (TransientException e) {
if (attempt == MAX_RETRIES - 1) throw e;
Thread.sleep(computeBackoff(attempt, 100, 5000));
}
}
What not to retry
Not every failure should be retried. Retrying a 400 Bad Request (client error) is pointless — the request is malformed and retrying with the same payload produces the same error. Retrying a 409 Conflict (business rule violation: item out of stock) wastes resources. Only transient errors — 503 Service Unavailable, 429 Too Many Requests, network timeouts, connection resets — benefit from retry.
public boolean isRetryable(Exception ex) {
if (ex instanceof FeignException feignEx) {
int status = feignEx.status();
return status == 503 || status == 429 || status == 502 || status == 504;
}
if (ex instanceof SocketTimeoutException || ex instanceof ConnectException) {
return true;
}
return false;
}
For 429 Too Many Requests specifically, the server may include a Retry-After header indicating how long to wait. Honor it:
if (ex instanceof FeignException feignEx && feignEx.status() == 429) {
String retryAfter = feignEx.responseHeaders()
.getOrDefault("Retry-After", List.of("1")).get(0);
Thread.sleep(Long.parseLong(retryAfter) * 1000);
}
Retry and circuit breakers together
Retries and circuit breakers work best together. The circuit breaker prevents retrying against a known-down service (stops calls immediately rather than retrying into a black hole). The retry logic handles transient blips before the circuit breaker threshold is hit. Configure them with the right relationship:
- Retry: 3 attempts with exponential backoff + jitter
- Circuit breaker: opens after 50% failure rate in a 20-call sliding window
With these settings, a genuine service outage triggers the circuit breaker (preventing retries from amplifying load) while brief transient failures (connection resets, single slow responses) are handled by retries before the circuit threshold is reached.
Test this combination explicitly in staging by injecting failures at different rates and durations. Verify that transient failures (< 30 seconds) are handled by retries without triggering the circuit breaker, and genuine outages (> 60 seconds) open the circuit breaker and stop retry amplification.