Why Your CI Tests Pass but Production Still Breaks
by Arif Ikhsanudin, Backend Developer
Green in CI. Broken in Production. What Went Wrong?
The postmortem is underway. Tests passed. The build was green. Deployment looked clean. And yet: the payment service is returning 500s, the cause is a NullPointerException on a field that's null in production data but never null in test fixtures, and the incident has been running for 22 minutes.
This is not a test coverage failure in the simple sense. You could have 90% line coverage and still miss this. The failure is a gap between what your tests model and what production actually looks like. That gap has predictable shapes.
Shape 1: Your Test Data Doesn't Reflect Production Data
Test fixtures are written by engineers who understand the domain. They're tidy, complete, and internally consistent. Production data is written by users, imported from legacy systems, corrupted by bugs in previous versions, and filtered through validation rules that changed three times in the last two years.
The most common version of this failure: a field that's logically required but nullable in the database because it was added after the initial schema design. Tests always set it. Production has millions of rows where it's null. The code path that reads it without a null check passes all tests and fails for a meaningful percentage of real requests.
// Passes tests (fixtures always set this field):
public String buildReference(Order order) {
return order.getExternalReference().toUpperCase(); // NPE in prod
}
// Handles real data:
public String buildReference(Order order) {
String ref = order.getExternalReference();
return ref != null ? ref.toUpperCase() : "REF-" + order.getId();
}
The fix requires two things: adding test cases with missing/null/malformed data for every non-trivial field, and periodically sampling anonymized production data to create test fixtures that reflect real-world distributions. The second is more work but catches things the first doesn't.
Shape 2: Your Staging Environment Is Not Production
Staging has different configuration, different backing service versions, different data volume, different traffic patterns, and usually different infrastructure sizing. Tests that pass in staging encode all of these differences as implicit assumptions.
Common divergences that cause prod-only failures:
- Different database version: Staging on MySQL 8.0.28, production on MySQL 8.0.36. SQL that passes in 8.0.28 may have different behavior in 8.0.36 for edge cases in date handling, JSON functions, or index selection.
- Different memory limits: A query that loads comfortably into memory in staging (100K rows) causes OOM in production (10M rows).
- Different connection pool sizing: Staging configured for 10 concurrent connections, production for 500. Thread contention patterns differ.
- Different JVM flags: Staging using G1GC, production using ZGC (or vice versa). GC pauses affect timeout behavior.
The discipline required: treat configuration drift between staging and production as a bug. Maintain both environments from the same infrastructure-as-code base (Terraform modules, Helm charts). Alert on divergence. Staging should be production at smaller scale, not a different thing.
Shape 3: Third-Party Services Behave Differently Under Real Load
Your test suite mocks external services. The mocks return what you tell them to return — correct responses, on time, every time. Real external services have rate limits, timeouts, eventual consistency, and version-specific quirks that your mocks don't model.
The payment gateway that your mock returns in 50ms takes 2.3 seconds when the gateway is under load on Black Friday. Your 2-second timeout was set based on mock behavior. Production requests start timing out.
This requires testing with realistic response characteristics, not just realistic response data:
// Using WireMock with realistic timing
stubFor(post(urlEqualTo("/payments/charge"))
.willReturn(aResponse()
.withStatus(200)
.withFixedDelay(1500) // Simulate a slow but successful response
.withBody("{\"status\":\"authorized\"}")));
// Test that your timeout handling is correct:
stubFor(post(urlEqualTo("/payments/charge"))
.willReturn(aResponse()
.withStatus(200)
.withFixedDelay(5000))); // Simulate a gateway that times out
// Assert that the client handles the timeout gracefully
Load testing against a real (non-production) instance of third-party services is more valuable than mocking for integration tests that touch payment, notification, or identity providers.
Shape 4: Concurrency Behavior Changes at Scale
A race condition that would require 10,000 concurrent requests to trigger in production requires 5 concurrent requests in CI — if CI even attempts concurrent testing. Most don't. Unit tests are single-threaded by default. Integration tests often run against a database that has a single connection in the test environment.
Bugs that manifest only under production concurrency levels don't show up in CI. They show up in production at peak traffic, when the stakes are highest.
The mitigation is stress tests and chaos tests that run on a schedule (not on every commit — they're too slow) against a staging environment sized to handle real load. These aren't CI tests; they're production readiness tests that catch what CI can't.
The Practical Takeaway
Map your last five production incidents to the shape of the failure: was it data assumptions, environment drift, third-party behavior, or concurrency? The pattern tells you which gap in your test strategy to address first. Most teams find that data assumptions cause 60% of their CI-passes-prod-breaks incidents — and that a dedicated "bad data" test fixture library, maintained alongside the regular fixtures, eliminates most of them.