Distributed Tracing with OpenTelemetry - Microservice Troubleshooting

Distributed Tracing with OpenTelemetry - Microservice Troubleshooting

In a microservice architecture, a single user request passes through multiple services, and latency or failure at any point affects the entire chain. Finding which service creates the bottleneck through traditional log analysis can take hours. With OpenTelemetry distributed tracing, you can track ea

C

Can Kaya

Security Specialist

March 21, 202614 min read0

In a microservice architecture, a single user request passes through multiple services, and latency or failure at any point affects the entire chain. Finding which service creates the bottleneck through traditional log analysis can take hours. With OpenTelemetry distributed tracing, you can track each request's journey across services end-to-end via trace IDs and pinpoint latency down to the millisecond. This guide covers everything from OpenTelemetry architecture to SDK integration, Jaeger setup to production best practices.

What Is Distributed Tracing and Why Do You Need It?

Distributed tracing is an observability method that tracks a request across all services in a distributed system. While a single stack trace reveals the problem in monolithic applications, in microservices a request travels from the API Gateway through auth service, order service, payment service, and notification service.

Each service produces its own logs, but correlating these logs is difficult. Distributed tracing solves this with trace IDs and spans:

Concept Description Example
Trace The end-to-end journey of a request GET /api/orders/123 across all services
Span A single unit of work within a trace order-service: DB query (45ms)
Trace ID Unique identifier linking all spans together 4bf92f3577b34da6a3ce929d0e0e4736
Context Propagation Carrying trace information between services traceparent HTTP header

OpenTelemetry Architecture

OpenTelemetry (OTel) is a vendor-agnostic observability framework developed by the CNCF. It supports three signal types: traces, metrics, and logs. The previously separate OpenTracing and OpenCensus projects were merged under OpenTelemetry in 2019.

💡 Why OpenTelemetry? Send data to any backend - Jaeger, Zipkin, Datadog, Grafana Tempo, or AWS X-Ray - without vendor lock-in. SDKs are available in 11+ languages, and auto-instrumentation lets you start tracing without code changes.

Core Components

  • SDK (Software Development Kit) Library added to your application. Creates spans, handles context propagation, and sends telemetry data to the exporter.
  • OTel Collector Standalone service that receives, processes, and exports telemetry data to backends. Consists of receiver, processor, and exporter pipelines.
  • Backend (Jaeger, Tempo, Zipkin) System that stores and visualizes trace data. Jaeger is an open-source, production-ready option.
  • Auto-Instrumentation Libraries that automatically trace HTTP, gRPC, database, and message queue calls. No code changes required.

Jaeger + OTel Collector Docker Compose Setup

Jaeger is a distributed tracing backend developed by Uber and a CNCF graduated project. Combined with the OTel Collector, it collects and visualizes trace data from your applications.

docker-compose.yml
version: "3.8"
services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - tracing

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"     # OTLP gRPC receiver
      - "4318:4318"     # OTLP HTTP receiver
      - "8889:8889"     # Prometheus metrics
    depends_on:
      - jaeger
    networks:
      - tracing

networks:
  tracing:
    driver: bridge

The OTel Collector configuration file defines the receiver, processor, and exporter pipeline:

otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: percentage-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/jaeger]

⚠️ Important: In production, use the tail_sampling processor to retain only errored, slow, or a percentage of traces. Storing all traces rapidly increases storage costs. The configuration above retains 100% of errored requests, requests slower than 2 seconds, and 10% of the rest.

OpenTelemetry Setup in Node.js Applications

The OpenTelemetry SDK for Node.js offers auto-instrumentation support. Express, Fastify, HTTP, gRPC, PostgreSQL, Redis, and many more libraries are automatically traced.

terminal
npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Load the tracing configuration at the very beginning of your application (before other imports):

tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION }
  from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();
console.log('OpenTelemetry tracing initialized');

process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

Load the tracing file first when starting your application:

terminal
# TypeScript
node --require ./tracing.js dist/main.js

# Or with environment variables
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
node --require ./tracing.js dist/main.js

OpenTelemetry Setup in Python Applications

The Python SDK provides auto-instrumentation for popular libraries including Flask, Django, FastAPI, SQLAlchemy, Redis, and requests.

terminal
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
terminal
# Launch with zero-code instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument python app.py

Creating Manual Spans and Enrichment

Auto-instrumentation automatically traces HTTP and database calls, but you need to create manual spans to trace business logic operations (payment validation, inventory checks, price calculations).

order.service.ts
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      // Add business logic info to span
      span.setAttribute('order.id', orderId);
      span.setAttribute('order.source', 'web');

      // Child span: inventory check
      const stock = await tracer.startActiveSpan(
        'checkInventory',
        async (childSpan) => {
          const result = await inventoryService.check(orderId);
          childSpan.setAttribute('inventory.available', result.available);
          childSpan.end();
          return result;
        }
      );

      // Child span: payment processing
      await tracer.startActiveSpan(
        'processPayment',
        async (paymentSpan) => {
          paymentSpan.setAttribute('payment.method', 'credit_card');
          await paymentService.charge(orderId);
          paymentSpan.end();
        }
      );

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

💡 Best Practice: Never add sensitive data (credit card numbers, passwords, personal information) to span attributes. Only add business logic information that helps with troubleshooting (order ID, user type, payment method).

Context Propagation: Linking Traces Across Services

For distributed tracing to work, trace context must be carried between services. The W3C Trace Context standard achieves this via the traceparent HTTP header.

W3C traceparent header format
# Format: version-traceId-parentSpanId-traceFlags
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

# 00          = version (W3C spec v1)
# 4bf92f...   = 128-bit trace ID
# 00f067...   = 64-bit parent span ID
# 01          = trace flags (01 = sampled)

OpenTelemetry SDKs automatically add and read this header in HTTP requests. Propagation is also supported via gRPC metadata, Kafka headers, and AMQP properties. It works seamlessly between services written in different languages - when a Node.js service calls a Python service, the trace context is automatically carried over.

OTel Collector Deployment in Kubernetes

In Kubernetes, you can deploy the OTel Collector as a DaemonSet or Sidecar. The DaemonSet approach runs one collector per node and collects telemetry from all pods on that node.

Deployment Model Advantage Disadvantage
DaemonSet Resource efficient, centralized management Single config per node
Sidecar Per-service customization Higher resource consumption
Gateway (Deployment) Centralized sampling and routing Single point of failure risk

Recommended production architecture: DaemonSet collectors gather data from nodes, a Gateway collector handles centralized sampling and routing, then forwards to Jaeger or Tempo.

Production Best Practices

Sampling Strategy

Tracing every request in high-traffic systems is unsustainable from both performance and storage perspectives. Choosing the right sampling strategy is critical:

  • Head-based Sampling Decision made at request start. Simple with low overhead, but may miss errored requests. Suitable for development environments.
  • Tail-based Sampling (Recommended) Decision made after trace completion. 100% of errored and slow requests are retained, with a percentage of successful requests sampled. Configured in the OTel Collector.
  • Rate Limiting Limit the maximum number of traces per second. Keeps collector memory consumption under control during traffic spikes.

Minimizing Performance Impact

The OpenTelemetry SDK's impact on application performance is typically 1-3%, but misconfiguration can increase this. Key considerations:

Use the batch exporter (default). Send spans in batches rather than individually. Disable fs instrumentation - file system operations generate too many spans and are usually unnecessary. Keep attribute counts reasonable; 10-15 attributes per span is sufficient. Avoid adding large string values (request body, SQL queries) as attributes.

For centralized log management, check our ELK Stack guide. For server metrics, see our Prometheus + Grafana guide. For container orchestration, review our Introduction to Kubernetes guide. The OpenTelemetry Official Documentation and Jaeger Documentation are useful additional resources.

Frequently Asked Questions

What is the difference between OpenTelemetry and Jaeger?

OpenTelemetry is a telemetry data collection and transmission framework (SDK + Collector). Jaeger is a backend that stores and visualizes this data. OTel produces the data, Jaeger consumes it. You can use Grafana Tempo, Zipkin, or Datadog instead of Jaeger.

How much does distributed tracing affect application performance?

A properly configured OpenTelemetry SDK typically adds 1-3% overhead. This impact is minimized with batch exporters, appropriate sampling rates, and disabling unnecessary instrumentations. On critical paths, head-based sampling can further reduce overhead.

Is distributed tracing necessary for monolithic applications?

The benefit of distributed tracing in a single monolithic application is limited. However, if your monolith communicates with databases, cache (Redis), and external APIs, tracing can be useful for monitoring the duration and errors of these calls. Early integration provides an advantage if you plan to migrate to microservices.

How long should I retain trace data?

Generally 7-14 days is sufficient. You can retain errored traces longer (30 days). Configure automatic deletion with retention policies in Jaeger. If you have compliance requirements (PCI-DSS, SOC 2), determine the duration according to the relevant standard.

Conclusion

Set up distributed tracing with OpenTelemetry to detect latency points and errors end-to-end in your microservice architecture. Build a vendor-agnostic telemetry pipeline with the OTel Collector, control storage costs with tail-based sampling, and visualize traces through the Jaeger UI.

High-Performance Infrastructure for Your Microservices

Run your distributed tracing infrastructure reliably with Hosted Cloud servers.

Explore Cloud Server Plans →
C

Can Kaya

Security Specialist

CISSP-certified security expert creating content on cybersecurity, DDoS protection, and server hardening.

Comments coming soon