Your Pipeline Is Flaky and That Is a Bigger Problem Than You Think

by Arif Ikhsanudin, Backend Developer

The Red Build Nobody Investigates

Your pipeline fails. A developer looks at the job name and the stage, pulls up the logs, sees "connection refused" on a Testcontainer startup, and clicks retry. Forty seconds later, it's green. They move on. This happened three times yesterday. Nobody filed a ticket.

This is the slow death of your CI system. Not through a catastrophic failure — through accumulated tolerance for failures that "don't count." The team has learned that some red builds are real (code problem) and some are noise (environment problem), and they've learned to distinguish between them by feel rather than by pipeline reliability. The moment that distinction becomes learned behavior, your pipeline has stopped being a reliable safety net.

Why Flakiness Is a Trust Problem, Not a Time Problem

The obvious cost of a flaky test is the time spent on retries. A pipeline that runs 30 times a day with a 5% flake rate wastes 1.5 pipeline runs per day on retries — annoying but not catastrophic.

The hidden cost is that developers learn to discount red builds. Once the team accepts "that's probably just flakiness" as a valid response to a failure, every genuine failure has to compete with that assumption. How many times will a developer retry a real regression before concluding it's a flake? Once? Twice? The answer depends on how often retrying worked before — which is exactly what a high flake rate trains them to expect.

In high-flake environments, genuine regressions get merged. Not because developers are careless, but because the pipeline has taught them that red doesn't mean broken.

The Common Sources and Their Fixes

Time-dependent tests are the most common and most fixable. Any test that calls new Date(), System.currentTimeMillis(), or Instant.now() directly is potentially flaky if it asserts on timing or relies on a specific temporal state.

// Flaky: behavior changes based on when the test runs
@Test
void shouldRejectExpiredToken() {
    Token token = new Token(Instant.now().minusSeconds(5));
    assertTrue(token.isExpired()); // passes if run fast enough, fails if slow
}

// Stable: inject a controllable clock
@Test
void shouldRejectExpiredToken() {
    Clock fixed = Clock.fixed(Instant.parse("2026-04-25T10:00:00Z"), UTC);
    Token token = new Token(Instant.parse("2026-04-25T09:59:54Z"), fixed);
    assertTrue(token.isExpired(fixed.instant()));
}

External service dependencies are the second largest source. Tests that hit real HTTP endpoints, real databases, or real message brokers are subject to network variability, service availability, and rate limiting. Mock external services with WireMock for HTTP, use Testcontainers for databases but with properly configured startup health checks, and use in-memory implementations (like an embedded Kafka) for message brokers in unit tests.

// Testcontainers with proper startup guarantee
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:16")
    .withStartupTimeout(Duration.ofSeconds(60))
    .waitingFor(Wait.forHealthcheck());  // Don't just wait for port — wait for readiness

Shared mutable state between tests causes interference that depends on execution order. This is particularly common in Spring Boot integration tests that share an application context with mutable singletons or caches. Use @DirtiesContext to force context reload when state modification is unavoidable, or redesign tests to be state-independent.

Resource contention on CI runners — tests that bind to specific ports, write to specific file paths, or allocate more memory than available on the runner. Use random port allocation (port: 0 in Spring Boot tests), temp directories, and check runner memory specs against what your tests actually need.

Tracking Flakiness

You can't fix what you're not measuring. Set up flake tracking before optimizing:

# Simplified flake detection: runs that failed, then passed on retry
# Query your CI API for the last 30 days of runs
# Flag any run where: final_status == 'success' AND any_prior_attempt_status == 'failed'

flake_rate = flaky_runs / total_runs

# Per-test: if you export test results as JUnit XML, aggregate across runs
# A test that shows both PASS and FAIL in the last 100 runs is flaky

Most CI platforms (GitHub Actions, CircleCI, BuildKite) have built-in test insights that show flaky tests over time. If yours doesn't, export JUnit XML from your test runner and aggregate it externally.

Set a target — 1% flake rate across all pipeline runs — and treat exceeding it as a P2 incident. Not a someday cleanup task. An active incident with an owner and a resolution date.

The Policy That Accelerates Fixing

The most effective policy for eliminating flakiness is simple: any test that flakes twice in a week gets quarantined (moved to a non-blocking suite) within 24 hours, and gets fixed or deleted within two weeks. Quarantined means it still runs but doesn't block merging, so flakiness doesn't propagate into developer workflow while the fix is in progress.

This policy forces a decision: fix the test or delete it. Both are better than a flaky test in a blocking suite. The tests that "we should fix someday" never get fixed. The tests in a quarantine queue with a deadline do.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

How to Say “No” to Unreasonable Requests Professionally

Learning to say “no” is one of the hardest skills for developers and managers alike. Here’s how to protect your time without burning bridges.

Read more

Jenkins Still Works. But Ask Yourself Why You Are Still Using It.

Jenkins is capable, battle-tested, and widely understood. It is also operationally expensive, slow to configure, and increasingly mismatched to how teams want to work. The question is not whether Jenkins can do the job — it's whether it's the right tool for your context.

Read more

Stop Storing Everything in One Table. Normalization Exists for a Reason.

Denormalized schemas feel convenient until you have update anomalies, redundant data across millions of rows, and queries that require self-joins to answer basic questions — normalization is not academic overhead, it is how you maintain data integrity at scale.

Read more

Why Some Contractors Are Always Busy and Others Are Always Searching

The difference between a contractor who always has work and one who is constantly looking is rarely about skill. It is almost always about how they manage the pipeline.

Read more