Your Microservices Are Running. But Are They Healthy?

April 24, 2026

by Arif Ikhsanudin, Backend Developer

The gap between "running" and "healthy"

Your Kubernetes dashboard shows all pods green. Every service is running, zero restarts. A user calls your support team: "The app has been broken for twenty minutes." You check the logs. Order Service has been returning 503s because its database connection pool was exhausted — all threads blocked on a slow query. But the pod never crashed, so Kubernetes never restarted it, and your monitoring showed green pods.

A process that is running is not necessarily a process that is serving traffic correctly. Without health checks that test whether a service can actually do its job, "running" is a meaningless status indicator.

Liveness versus readiness: the critical distinction

Kubernetes provides two independent health check mechanisms with different semantics:

Liveness probe: "Is this process alive and worth keeping?" If a liveness probe fails, Kubernetes kills the pod and restarts it. Use this for stuck processes — deadlocks, infinite loops, unrecoverable errors. Liveness probes should be cheap and check only internal state. They should not check dependencies: if the database is down, the pod is not stuck — it's waiting. Killing and restarting it won't fix the database.

Readiness probe: "Is this process ready to serve traffic?" If a readiness probe fails, Kubernetes removes the pod from the service endpoint list — traffic stops being routed to it — but does not kill the pod. The pod stays running and can recover. Use this to indicate whether the service can currently handle requests, including dependency availability.

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

What each probe should check

Liveness endpoint (/health/liveness): return 200 if the process is alive and able to handle requests in principle. Check nothing external. A Spring Boot application using Actuator exposes this automatically and checks internal application state (whether the application context has loaded correctly, whether there are any registered LivenessStateHealthIndicator failures). This endpoint should always be fast (< 50ms).

Readiness endpoint (/health/readiness): return 200 if the service can currently serve traffic correctly. Check:

Database connection pool: can you acquire and release a connection?
Required external services: can you reach critical dependencies (not all dependencies — only those you cannot function without)?
Internal state: are required caches warmed? Are required resources loaded?

@Component
public class DatabaseHealthIndicator implements HealthIndicator {
    private final DataSource dataSource;

    @Override
    public Health health() {
        try (Connection conn = dataSource.getConnection()) {
            // Validate connection is usable, not just that pool has connections
            conn.isValid(1); // 1 second timeout
            return Health.up()
                .withDetail("pool.active", getActiveConnections())
                .withDetail("pool.idle", getIdleConnections())
                .build();
        } catch (SQLException e) {
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

Dependency health: what to check and what to skip

The readiness probe should include critical dependencies — those without which the service cannot serve any meaningful traffic — and exclude non-critical ones.

If Order Service's primary database is unavailable, Order Service should be marked not-ready. It cannot process orders without it.

If Order Service's recommendation cache (Redis) is unavailable, Order Service should remain ready. Recommendations are a non-critical feature; the service degrades gracefully without them. Including Redis in the readiness check would cause Order Service pods to fail readiness when Redis has issues — unnecessarily removing healthy capacity from the load balancer.

Document which dependencies are critical for each service. This decision belongs in the service's architecture decision record, not in an undocumented health check configuration.

Startup probes for slow-starting services

Some services — particularly those with heavy initialization (loading large models, warming caches from a database) — take a long time to be ready. If your liveness probe starts checking too early, Kubernetes kills the pod before it finishes starting.

The startup probe delays liveness checking until startup is complete:

startupProbe:
  httpGet:
    path: /actuator/health/liveness
  failureThreshold: 30      # try for up to 5 minutes
  periodSeconds: 10
  # After startup probe succeeds, liveness and readiness probes take over

During the startup probe period, the pod is not killed for failing liveness. Once the startup probe succeeds once, it stops, and liveness/readiness probes take over normally. This lets slow-starting services take as long as they need without the pod being killed prematurely.

Beyond process health: synthetic monitoring

Kubernetes health checks verify that individual service processes are functional. They don't verify that end-to-end user flows are working. Synthetic monitoring — automated tests that exercise real user journeys through your production system — fills this gap.

A synthetic test for checkout: every five minutes, place a test order with a dedicated test user, verify that payment processes, verify that a confirmation event is published. If any step fails, alert. This catches integration failures (a bug in how Order Service and Payment Service interact) that health checks on individual services would never surface.

Implement synthetic monitoring for your three to five most critical user flows. Keep them simple, idempotent (automatically clean up test data), and alert directly to your on-call rotation. A failed synthetic test is as urgent as a failed health check.

Our offices

Follow us

Your Microservices Are Running. But Are They Healthy?

The gap between "running" and "healthy"

Liveness versus readiness: the critical distinction

What each probe should check

Dependency health: what to check and what to skip

Startup probes for slow-starting services

Beyond process health: synthetic monitoring

Scale Your Backend - Need an Experienced Backend Developer?

Tell us about your project

Our offices

More articles

How to Know When Your Team Needs a Tech Lead

Circuit Breakers in Microservices: Stop Letting One Failure Break Everything

What to Do If You’re Always the “Junior” on Every Project

Feeling Stuck After 3 Years? How to Know if You’re Improving