Distributed Tracing: How to Find Where Your Request Actually Failed

by Arif Ikhsanudin, Backend Developer

The debugging experience without tracing

A user files a support ticket: "My checkout failed at 14:32 on Tuesday." You look in Order Service logs. You find an error, but the error message is "upstream service error." You ask the Inventory team to check their logs. They find a timeout in the DB query logs. Was that the cause? You check the database slow query log. The timestamps don't quite align. You're not sure if you're looking at the same request or a different one from around the same time.

Thirty minutes later, with input from three teams and five log files, you have a theory about what happened. You're not certain.

This is the debugging experience in microservices without distributed tracing — and it's the reason distributed tracing is not optional infrastructure. It is the minimum viable observability in a system where a single user request crosses multiple service boundaries.

How distributed tracing works

The core concept: every request gets a unique trace ID when it enters the system. That trace ID is propagated through every service-to-service call via HTTP headers. Each service records spans — timed operations within the service — tagged with the trace ID. A tracing backend collects these spans and assembles them into a complete trace: a timeline showing which services handled the request, in what order, and how long each step took.

The W3C Trace Context specification (RFC defined in traceparent and tracestate headers) is the modern standard for trace ID propagation. OpenTelemetry is the standard instrumentation library that implements it.

// Spring Boot with OpenTelemetry auto-instrumentation
// No code changes needed — configure via agent at startup

// In Dockerfile or deployment:
// JAVA_OPTS="-javaagent:/otel-javaagent.jar"
// OTEL_SERVICE_NAME=order-service
// OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
// OTEL_TRACES_EXPORTER=otlp

The OpenTelemetry Java agent auto-instruments Spring Boot, JDBC, Kafka clients, and HTTP clients — no manual span creation required for common operations. The agent injects and propagates trace context automatically.

What a trace shows you

In Jaeger or Grafana Tempo (two common backends for OpenTelemetry traces), a trace looks like a Gantt chart: horizontal bars representing spans, nested to show parent-child relationships, with timestamps and duration.

[Order Service] POST /orders                    0ms - 245ms
  [Order Service] validate request              0ms - 5ms
  [Order Service] HTTP GET /users/{id}          5ms - 45ms
    [User Service] GET /users/123               5ms - 45ms
      [User Service] DB SELECT users            8ms - 42ms  ← 34ms query
  [Order Service] HTTP POST /inventory/reserve  45ms - 240ms
    [Inventory Service] POST /reserve           45ms - 240ms
      [Inventory Service] DB UPDATE inventory   48ms - 238ms ← 190ms, lock wait

From this trace, you can see immediately: the slow database query in Inventory Service caused a 190ms lock wait, which is the dominant factor in the total 245ms request time. Without the trace, you'd be looking at Order Service logs showing a 240ms request time with no internal detail.

Sampling strategy

Recording every span for every request in a high-traffic system is expensive. Sampling is the practice of recording only a fraction of traces.

Head-based sampling (decide at trace start): sample 10% of all requests. Simple but means failures — which you most want to trace — are sampled at the same rate as successful requests and may not be captured.

Tail-based sampling (decide after request completes): sample 100% of traces with errors or high latency, and 1% of everything else. This captures exactly the cases you care about most. Requires a trace collector that can buffer spans and apply the sampling decision after the fact (OpenTelemetry Collector with tail-sampling processor, or Grafana Tempo).

# OpenTelemetry Collector: tail-based sampling config
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
    - name: errors-policy
      type: status_code
      status_code: {status_codes: [ERROR]}
    - name: slow-traces-policy
      type: latency
      latency: {threshold_ms: 1000}
    - name: probabilistic-policy
      type: probabilistic
      probabilistic: {sampling_percentage: 1}

For most teams: start with 10% head-based sampling. Move to tail-based sampling once your tracing infrastructure is stable and you understand the data volume.

Adding custom spans and attributes

Auto-instrumentation covers framework-level operations. For business-logic-level visibility — "why did the inventory reservation fail for this specific item?" — add custom spans:

Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");

Span span = tracer.spanBuilder("inventory.reserve")
    .setAttribute("item.id", itemId)
    .setAttribute("requested.quantity", quantity)
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    int reserved = inventoryRepository.reserve(itemId, quantity);
    span.setAttribute("reserved.quantity", reserved);
    span.setAttribute("reservation.success", reserved >= quantity);
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    throw e;
} finally {
    span.end();
}

Custom attributes let you search traces by business attributes — "show me all traces where item.id = sku-789 and reservation.success = false" — which transforms debugging from timestamp archaeology to targeted query.

Start with tracing on your most critical user paths. Once the infrastructure is in place and teams are familiar with reading traces, expand coverage. The infrastructure investment is front-loaded; the debugging time savings accrue continuously.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Spring Boot Request Processing Overhead — Filter Chains, Serialization, and What's Worth Measuring

Spring Boot's request processing pipeline adds overhead before and after your business logic runs. Most of it is negligible. Some of it isn't. Here is how to measure each layer and what actually warrants optimization.

Read more

API Keys Are Not the Same as Authentication. Here Is the Difference.

API keys identify a caller. Authentication verifies identity. Treating them as equivalent is what leads to security models that look solid but are not.

Read more

Spring Data JPA Auditing — @CreatedDate, @LastModifiedBy, and Entity Lifecycle Tracking

Audit fields — who created this record and when, who last modified it — are required in most production applications. Spring Data JPA provides this automatically with minimal configuration, but the integration with bulk operations, tests, and security context has specific traps worth knowing.

Read more

What to Look for When Hiring a Senior Backend Contractor — and What Most Startups Get Wrong

Evaluating a backend contractor is a different skill than evaluating a full-time hire. Most startups apply the wrong criteria and get surprised by the results.

Read more