Your Microservices Are Running. But Are They Healthy?

by Arif Ikhsanudin, Backend Developer

The gap between "running" and "healthy"

Your Kubernetes dashboard shows all pods green. Every service is running, zero restarts. A user calls your support team: "The app has been broken for twenty minutes." You check the logs. Order Service has been returning 503s because its database connection pool was exhausted — all threads blocked on a slow query. But the pod never crashed, so Kubernetes never restarted it, and your monitoring showed green pods.

A process that is running is not necessarily a process that is serving traffic correctly. Without health checks that test whether a service can actually do its job, "running" is a meaningless status indicator.

Liveness versus readiness: the critical distinction

Kubernetes provides two independent health check mechanisms with different semantics:

Liveness probe: "Is this process alive and worth keeping?" If a liveness probe fails, Kubernetes kills the pod and restarts it. Use this for stuck processes — deadlocks, infinite loops, unrecoverable errors. Liveness probes should be cheap and check only internal state. They should not check dependencies: if the database is down, the pod is not stuck — it's waiting. Killing and restarting it won't fix the database.

Readiness probe: "Is this process ready to serve traffic?" If a readiness probe fails, Kubernetes removes the pod from the service endpoint list — traffic stops being routed to it — but does not kill the pod. The pod stays running and can recover. Use this to indicate whether the service can currently handle requests, including dependency availability.

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

What each probe should check

Liveness endpoint (/health/liveness): return 200 if the process is alive and able to handle requests in principle. Check nothing external. A Spring Boot application using Actuator exposes this automatically and checks internal application state (whether the application context has loaded correctly, whether there are any registered LivenessStateHealthIndicator failures). This endpoint should always be fast (< 50ms).

Readiness endpoint (/health/readiness): return 200 if the service can currently serve traffic correctly. Check:

  • Database connection pool: can you acquire and release a connection?
  • Required external services: can you reach critical dependencies (not all dependencies — only those you cannot function without)?
  • Internal state: are required caches warmed? Are required resources loaded?
@Component
public class DatabaseHealthIndicator implements HealthIndicator {
    private final DataSource dataSource;

    @Override
    public Health health() {
        try (Connection conn = dataSource.getConnection()) {
            // Validate connection is usable, not just that pool has connections
            conn.isValid(1); // 1 second timeout
            return Health.up()
                .withDetail("pool.active", getActiveConnections())
                .withDetail("pool.idle", getIdleConnections())
                .build();
        } catch (SQLException e) {
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

Dependency health: what to check and what to skip

The readiness probe should include critical dependencies — those without which the service cannot serve any meaningful traffic — and exclude non-critical ones.

If Order Service's primary database is unavailable, Order Service should be marked not-ready. It cannot process orders without it.

If Order Service's recommendation cache (Redis) is unavailable, Order Service should remain ready. Recommendations are a non-critical feature; the service degrades gracefully without them. Including Redis in the readiness check would cause Order Service pods to fail readiness when Redis has issues — unnecessarily removing healthy capacity from the load balancer.

Document which dependencies are critical for each service. This decision belongs in the service's architecture decision record, not in an undocumented health check configuration.

Startup probes for slow-starting services

Some services — particularly those with heavy initialization (loading large models, warming caches from a database) — take a long time to be ready. If your liveness probe starts checking too early, Kubernetes kills the pod before it finishes starting.

The startup probe delays liveness checking until startup is complete:

startupProbe:
  httpGet:
    path: /actuator/health/liveness
  failureThreshold: 30      # try for up to 5 minutes
  periodSeconds: 10
  # After startup probe succeeds, liveness and readiness probes take over

During the startup probe period, the pod is not killed for failing liveness. Once the startup probe succeeds once, it stops, and liveness/readiness probes take over normally. This lets slow-starting services take as long as they need without the pod being killed prematurely.

Beyond process health: synthetic monitoring

Kubernetes health checks verify that individual service processes are functional. They don't verify that end-to-end user flows are working. Synthetic monitoring — automated tests that exercise real user journeys through your production system — fills this gap.

A synthetic test for checkout: every five minutes, place a test order with a dedicated test user, verify that payment processes, verify that a confirmation event is published. If any step fails, alert. This catches integration failures (a bug in how Order Service and Payment Service interact) that health checks on individual services would never surface.

Implement synthetic monitoring for your three to five most critical user flows. Keep them simple, idempotent (automatically clean up test data), and alert directly to your on-call rotation. A failed synthetic test is as urgent as a failed health check.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Why London Startups Are Quietly Moving Backend Work to Async Remote Contractors

Your engineer quit last month. You still haven't replaced them. Maybe you don't need to.

Read more

How Seoul Tech Startups Are Filling Senior Backend Gaps Without Competing With the Big Players

Competing with Samsung and Kakao for backend engineers is a losing game for most startups. The ones shipping consistently have stopped playing it.

Read more

REST API Design in Practice — The Decisions That Determine Developer Experience

REST APIs are built once and integrated against indefinitely. The design decisions made in the first hour — resource modeling, error shapes, versioning, pagination — determine how much friction every integration will carry forever.

Read more

JWT Across Microservices: How to Do It Without Repeating Yourself

Duplicating JWT validation logic across every service is a maintenance problem waiting to become a security incident. The right architecture validates once at the gateway and propagates verified identity — but the details of how matter significantly.

Read more