Why Your CI Tests Pass but Production Still Breaks

by Arif Ikhsanudin, Backend Developer

Green in CI. Broken in Production. What Went Wrong?

The postmortem is underway. Tests passed. The build was green. Deployment looked clean. And yet: the payment service is returning 500s, the cause is a NullPointerException on a field that's null in production data but never null in test fixtures, and the incident has been running for 22 minutes.

This is not a test coverage failure in the simple sense. You could have 90% line coverage and still miss this. The failure is a gap between what your tests model and what production actually looks like. That gap has predictable shapes.

Shape 1: Your Test Data Doesn't Reflect Production Data

Test fixtures are written by engineers who understand the domain. They're tidy, complete, and internally consistent. Production data is written by users, imported from legacy systems, corrupted by bugs in previous versions, and filtered through validation rules that changed three times in the last two years.

The most common version of this failure: a field that's logically required but nullable in the database because it was added after the initial schema design. Tests always set it. Production has millions of rows where it's null. The code path that reads it without a null check passes all tests and fails for a meaningful percentage of real requests.

// Passes tests (fixtures always set this field):
public String buildReference(Order order) {
    return order.getExternalReference().toUpperCase(); // NPE in prod
}

// Handles real data:
public String buildReference(Order order) {
    String ref = order.getExternalReference();
    return ref != null ? ref.toUpperCase() : "REF-" + order.getId();
}

The fix requires two things: adding test cases with missing/null/malformed data for every non-trivial field, and periodically sampling anonymized production data to create test fixtures that reflect real-world distributions. The second is more work but catches things the first doesn't.

Shape 2: Your Staging Environment Is Not Production

Staging has different configuration, different backing service versions, different data volume, different traffic patterns, and usually different infrastructure sizing. Tests that pass in staging encode all of these differences as implicit assumptions.

Common divergences that cause prod-only failures:

  • Different database version: Staging on MySQL 8.0.28, production on MySQL 8.0.36. SQL that passes in 8.0.28 may have different behavior in 8.0.36 for edge cases in date handling, JSON functions, or index selection.
  • Different memory limits: A query that loads comfortably into memory in staging (100K rows) causes OOM in production (10M rows).
  • Different connection pool sizing: Staging configured for 10 concurrent connections, production for 500. Thread contention patterns differ.
  • Different JVM flags: Staging using G1GC, production using ZGC (or vice versa). GC pauses affect timeout behavior.

The discipline required: treat configuration drift between staging and production as a bug. Maintain both environments from the same infrastructure-as-code base (Terraform modules, Helm charts). Alert on divergence. Staging should be production at smaller scale, not a different thing.

Shape 3: Third-Party Services Behave Differently Under Real Load

Your test suite mocks external services. The mocks return what you tell them to return — correct responses, on time, every time. Real external services have rate limits, timeouts, eventual consistency, and version-specific quirks that your mocks don't model.

The payment gateway that your mock returns in 50ms takes 2.3 seconds when the gateway is under load on Black Friday. Your 2-second timeout was set based on mock behavior. Production requests start timing out.

This requires testing with realistic response characteristics, not just realistic response data:

// Using WireMock with realistic timing
stubFor(post(urlEqualTo("/payments/charge"))
    .willReturn(aResponse()
        .withStatus(200)
        .withFixedDelay(1500)      // Simulate a slow but successful response
        .withBody("{\"status\":\"authorized\"}")));

// Test that your timeout handling is correct:
stubFor(post(urlEqualTo("/payments/charge"))
    .willReturn(aResponse()
        .withStatus(200)
        .withFixedDelay(5000)));   // Simulate a gateway that times out

// Assert that the client handles the timeout gracefully

Load testing against a real (non-production) instance of third-party services is more valuable than mocking for integration tests that touch payment, notification, or identity providers.

Shape 4: Concurrency Behavior Changes at Scale

A race condition that would require 10,000 concurrent requests to trigger in production requires 5 concurrent requests in CI — if CI even attempts concurrent testing. Most don't. Unit tests are single-threaded by default. Integration tests often run against a database that has a single connection in the test environment.

Bugs that manifest only under production concurrency levels don't show up in CI. They show up in production at peak traffic, when the stakes are highest.

The mitigation is stress tests and chaos tests that run on a schedule (not on every commit — they're too slow) against a staging environment sized to handle real load. These aren't CI tests; they're production readiness tests that catch what CI can't.

The Practical Takeaway

Map your last five production incidents to the shape of the failure: was it data assumptions, environment drift, third-party behavior, or concurrency? The pattern tells you which gap in your test strategy to address first. Most teams find that data assumptions cause 60% of their CI-passes-prod-breaks incidents — and that a dedicated "bad data" test fixture library, maintained alongside the regular fixtures, eliminates most of them.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

How to Know If Your API Is Production-Ready

Shipping an API isn’t the hard part. Shipping one that doesn’t break under real users is. Here’s what separates “it works” from “it’s ready for production.

Read more

What a Professional Contract Should Cover Before You Start Any Work

A contract is not bureaucracy. It is the document that prevents the most predictable and painful problems in contracting — the ones that come up in every engagement that does not have one.

Read more

Hiring Backend Engineers in Copenhagen Means Competing With Danske Bank and Novo Nordisk — or Going Remote

Danske Bank posted the same backend role you did. They offered DKK 15K more per month, a pension you can't match, and a brand your candidate's parents have heard of.

Read more

Designing for Growth Without Over-Engineering for a Future That May Never Come

The goal is a system that can evolve, not one that has already evolved into a form the problem has not yet required. Extensibility and over-engineering are not the same thing.

Read more