Graceful Degradation: How to Keep Your App Running When Things Break
by Arif Ikhsanudin, Backend Developer
The difference between degraded and down
Your recommendation engine is unavailable. What happens to your product page? Option A: the page returns a 500 error and users see a broken experience. Option B: the page loads without recommendations, showing a static "popular items" fallback or nothing in that section. The difference between these outcomes is not infrastructure — it's whether anyone on your team made an explicit design decision about what happens when the recommendation service is unavailable.
Graceful degradation means your system provides reduced but functional service when a dependency fails, rather than failing completely. It is a design discipline, not a resilience pattern you can bolt on with a library. The library (circuit breaker, fallback handler) is the mechanism. The decision about what reduced service looks like is yours to make before the failure happens.
Mapping your degradation modes
For every external dependency your service has, you need an explicit answer to: "what does my service do when this is unavailable?"
Start by categorizing dependencies by criticality:
Critical dependencies: the service cannot fulfill its core function without them. If Order Service's payment database is unavailable, it cannot accept orders — there is no graceful degradation, only an honest failure. These dependencies justify a hard fail with a clear error.
Non-critical dependencies: the service can provide reduced but useful functionality without them. Recommendations, personalization, enhanced metadata, activity logging, analytics — all of these can be absent without preventing the core user flow.
Asynchronous dependencies: message brokers, notification services, audit log services. These can fail without affecting synchronous responses — the main flow completes, and the async work either retries or is recorded in a dead letter queue.
Document this categorization explicitly. A dependency map that shows criticality levels makes graceful degradation decisions concrete and reviewable.
Fallback strategies by dependency type
Cached responses: for dependencies that provide data that changes slowly, serve the last known good response from a local cache when the dependency is unavailable. Product details, user preferences, feature flags, configuration data — all reasonable candidates for cache-based fallback.
The key parameter is acceptable staleness. If your product catalog is cached for 5 minutes, you can serve 5-minute-old data when the catalog service is slow. If the catalog service is down for 30 minutes, you serve 30-minute-old data. Whether that's acceptable is a product decision, not a technical one.
@CircuitBreaker(name = "catalogService", fallbackMethod = "catalogFromCache")
public ProductDetails getProductDetails(String productId) {
return catalogClient.getDetails(productId);
}
private ProductDetails catalogFromCache(String productId, Exception ex) {
return productCache.getIfPresent(productId) // Caffeine local cache
.orElse(ProductDetails.minimal(productId)); // last resort: minimal data
}
Default responses: when no cached data exists and the dependency is unavailable, return a sensible default. Empty recommendations list rather than error. "Price unavailable" rather than page crash. "Shipping estimate unavailable" rather than form failure.
Feature disabling: for features entirely powered by an unavailable service, disable the feature cleanly. The recommendations section doesn't render. The personalized banner shows a generic one. The "customers also viewed" widget is hidden. This requires that features be designed with the "absent" state in mind — not as an afterthought.
Async queue with response: for operations that don't need to be processed synchronously, accept the request, queue it locally, and return a success response. If Email Service is down, persist the email to a local pending_emails table and process it when the service recovers. The user is told their action was received. The email sends later.
Communicating degradation to users
The worst user experience is silent degradation — the page loads but shows wrong data, or a feature appears to work but silently fails. This is worse than an honest error because users don't know to be skeptical of what they're seeing.
Design your degradation UX explicitly:
- If recommendations are unavailable, show "Recommendations unavailable right now" or nothing — not stale data from 3 days ago presented as current
- If pricing is from cache, show it with a "Prices may not reflect the latest updates" notice if staleness is a material concern
- If an action was queued rather than immediately processed, tell the user: "Your request was received and will be processed shortly"
Honesty in degraded states builds more trust than pretending everything is fine.
Testing degradation paths
Graceful degradation paths that aren't tested will fail in unexpected ways when they're actually needed. The fallback code is often the least-exercised code in your service.
Inject dependency failures in your integration test suite:
@Test
void productPageDegrades_whenCatalogServiceUnavailable() {
// Stub catalog service to return 503
wireMockServer.stubFor(get(urlPathMatching("/products/.*"))
.willReturn(serviceUnavailable()));
// Verify the page still renders with fallback data
ProductDetails result = productService.getProductDetails("sku-123");
assertThat(result).isNotNull();
assertThat(result.getTitle()).isEqualTo("Product sku-123"); // minimal fallback
assertThat(result.isFromCache()).isFalse();
assertThat(result.isDegraded()).isTrue();
}
Run a quarterly degradation drill in staging: take down each non-critical dependency one at a time and verify the system behaves as designed. Your runbooks should describe what degraded state looks like for each dependency so on-call engineers can recognize expected degradation versus unexpected failure.
The work is not in the circuit breaker configuration. It's in deciding, for every dependency, what your system should do when it's gone — and then building, testing, and documenting that behavior before you need it.