
Distributed Tracing with OpenTelemetry - Microservice Troubleshooting
In a microservice architecture, a single user request passes through multiple services, and latency or failure at any point affects the entire chain. Finding which service creates the bottleneck through traditional log analysis can take hours. With OpenTelemetry distributed tracing, you can track ea
Can Kaya
Security Specialist
In a microservice architecture, a single user request passes through multiple services, and latency or failure at any point affects the entire chain. Finding which service creates the bottleneck through traditional log analysis can take hours. With OpenTelemetry distributed tracing, you can track each request's journey across services end-to-end via trace IDs and pinpoint latency down to the millisecond. This guide covers everything from OpenTelemetry architecture to SDK integration, Jaeger setup to production best practices.
What Is Distributed Tracing and Why Do You Need It?
Distributed tracing is an observability method that tracks a request across all services in a distributed system. While a single stack trace reveals the problem in monolithic applications, in microservices a request travels from the API Gateway through auth service, order service, payment service, and notification service.
Each service produces its own logs, but correlating these logs is difficult. Distributed tracing solves this with trace IDs and spans:
| Concept | Description | Example |
|---|---|---|
| Trace | The end-to-end journey of a request | GET /api/orders/123 across all services |
| Span | A single unit of work within a trace | order-service: DB query (45ms) |
| Trace ID | Unique identifier linking all spans together | 4bf92f3577b34da6a3ce929d0e0e4736 |
| Context Propagation | Carrying trace information between services | traceparent HTTP header |
OpenTelemetry Architecture
OpenTelemetry (OTel) is a vendor-agnostic observability framework developed by the CNCF. It supports three signal types: traces, metrics, and logs. The previously separate OpenTracing and OpenCensus projects were merged under OpenTelemetry in 2019.
💡 Why OpenTelemetry? Send data to any backend - Jaeger, Zipkin, Datadog, Grafana Tempo, or AWS X-Ray - without vendor lock-in. SDKs are available in 11+ languages, and auto-instrumentation lets you start tracing without code changes.
Core Components
-
SDK (Software Development Kit) Library added to your application. Creates spans, handles context propagation, and sends telemetry data to the exporter.
-
OTel Collector Standalone service that receives, processes, and exports telemetry data to backends. Consists of receiver, processor, and exporter pipelines.
-
Backend (Jaeger, Tempo, Zipkin) System that stores and visualizes trace data. Jaeger is an open-source, production-ready option.
-
Auto-Instrumentation Libraries that automatically trace HTTP, gRPC, database, and message queue calls. No code changes required.
Jaeger + OTel Collector Docker Compose Setup
Jaeger is a distributed tracing backend developed by Uber and a CNCF graduated project. Combined with the OTel Collector, it collects and visualizes trace data from your applications.
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
networks:
- tracing
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8889:8889" # Prometheus metrics
depends_on:
- jaeger
networks:
- tracing
networks:
tracing:
driver: bridge
The OTel Collector configuration file defines the receiver, processor, and exporter pipeline:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: percentage-sample
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/jaeger]
⚠️ Important: In production, use the tail_sampling processor to retain only errored, slow, or a percentage of traces. Storing all traces rapidly increases storage costs. The configuration above retains 100% of errored requests, requests slower than 2 seconds, and 10% of the rest.
OpenTelemetry Setup in Node.js Applications
The OpenTelemetry SDK for Node.js offers auto-instrumentation support. Express, Fastify, HTTP, gRPC, PostgreSQL, Redis, and many more libraries are automatically traced.
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
Load the tracing configuration at the very beginning of your application (before other imports):
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION }
from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'order-service',
[ATTR_SERVICE_VERSION]: '1.2.0',
'deployment.environment': process.env.NODE_ENV || 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
console.log('OpenTelemetry tracing initialized');
process.on('SIGTERM', () => {
sdk.shutdown().then(() => process.exit(0));
});
Load the tracing file first when starting your application:
# TypeScript
node --require ./tracing.js dist/main.js
# Or with environment variables
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
node --require ./tracing.js dist/main.js
OpenTelemetry Setup in Python Applications
The Python SDK provides auto-instrumentation for popular libraries including Flask, Django, FastAPI, SQLAlchemy, Redis, and requests.
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Launch with zero-code instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument python app.py
Creating Manual Spans and Enrichment
Auto-instrumentation automatically traces HTTP and database calls, but you need to create manual spans to trace business logic operations (payment validation, inventory checks, price calculations).
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
try {
// Add business logic info to span
span.setAttribute('order.id', orderId);
span.setAttribute('order.source', 'web');
// Child span: inventory check
const stock = await tracer.startActiveSpan(
'checkInventory',
async (childSpan) => {
const result = await inventoryService.check(orderId);
childSpan.setAttribute('inventory.available', result.available);
childSpan.end();
return result;
}
);
// Child span: payment processing
await tracer.startActiveSpan(
'processPayment',
async (paymentSpan) => {
paymentSpan.setAttribute('payment.method', 'credit_card');
await paymentService.charge(orderId);
paymentSpan.end();
}
);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
💡 Best Practice: Never add sensitive data (credit card numbers, passwords, personal information) to span attributes. Only add business logic information that helps with troubleshooting (order ID, user type, payment method).
Context Propagation: Linking Traces Across Services
For distributed tracing to work, trace context must be carried between services. The W3C Trace Context standard achieves this via the traceparent HTTP header.
# Format: version-traceId-parentSpanId-traceFlags
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# 00 = version (W3C spec v1)
# 4bf92f... = 128-bit trace ID
# 00f067... = 64-bit parent span ID
# 01 = trace flags (01 = sampled)
OpenTelemetry SDKs automatically add and read this header in HTTP requests. Propagation is also supported via gRPC metadata, Kafka headers, and AMQP properties. It works seamlessly between services written in different languages - when a Node.js service calls a Python service, the trace context is automatically carried over.
OTel Collector Deployment in Kubernetes
In Kubernetes, you can deploy the OTel Collector as a DaemonSet or Sidecar. The DaemonSet approach runs one collector per node and collects telemetry from all pods on that node.
| Deployment Model | Advantage | Disadvantage |
|---|---|---|
| DaemonSet | Resource efficient, centralized management | Single config per node |
| Sidecar | Per-service customization | Higher resource consumption |
| Gateway (Deployment) | Centralized sampling and routing | Single point of failure risk |
Recommended production architecture: DaemonSet collectors gather data from nodes, a Gateway collector handles centralized sampling and routing, then forwards to Jaeger or Tempo.
Production Best Practices
Sampling Strategy
Tracing every request in high-traffic systems is unsustainable from both performance and storage perspectives. Choosing the right sampling strategy is critical:
-
Head-based Sampling Decision made at request start. Simple with low overhead, but may miss errored requests. Suitable for development environments.
-
Tail-based Sampling (Recommended) Decision made after trace completion. 100% of errored and slow requests are retained, with a percentage of successful requests sampled. Configured in the OTel Collector.
-
Rate Limiting Limit the maximum number of traces per second. Keeps collector memory consumption under control during traffic spikes.
Minimizing Performance Impact
The OpenTelemetry SDK's impact on application performance is typically 1-3%, but misconfiguration can increase this. Key considerations:
Use the batch exporter (default). Send spans in batches rather than individually. Disable fs instrumentation - file system operations generate too many spans and are usually unnecessary. Keep attribute counts reasonable; 10-15 attributes per span is sufficient. Avoid adding large string values (request body, SQL queries) as attributes.
For centralized log management, check our ELK Stack guide. For server metrics, see our Prometheus + Grafana guide. For container orchestration, review our Introduction to Kubernetes guide. The OpenTelemetry Official Documentation and Jaeger Documentation are useful additional resources.
Frequently Asked Questions
What is the difference between OpenTelemetry and Jaeger?
OpenTelemetry is a telemetry data collection and transmission framework (SDK + Collector). Jaeger is a backend that stores and visualizes this data. OTel produces the data, Jaeger consumes it. You can use Grafana Tempo, Zipkin, or Datadog instead of Jaeger.
How much does distributed tracing affect application performance?
A properly configured OpenTelemetry SDK typically adds 1-3% overhead. This impact is minimized with batch exporters, appropriate sampling rates, and disabling unnecessary instrumentations. On critical paths, head-based sampling can further reduce overhead.
Is distributed tracing necessary for monolithic applications?
The benefit of distributed tracing in a single monolithic application is limited. However, if your monolith communicates with databases, cache (Redis), and external APIs, tracing can be useful for monitoring the duration and errors of these calls. Early integration provides an advantage if you plan to migrate to microservices.
How long should I retain trace data?
Generally 7-14 days is sufficient. You can retain errored traces longer (30 days). Configure automatic deletion with retention policies in Jaeger. If you have compliance requirements (PCI-DSS, SOC 2), determine the duration according to the relevant standard.
Conclusion
Set up distributed tracing with OpenTelemetry to detect latency points and errors end-to-end in your microservice architecture. Build a vendor-agnostic telemetry pipeline with the OTel Collector, control storage costs with tail-based sampling, and visualize traces through the Jaeger UI.
High-Performance Infrastructure for Your Microservices
Run your distributed tracing infrastructure reliably with Hosted Cloud servers.
Explore Cloud Server Plans →Can Kaya
Security Specialist
CISSP-certified security expert creating content on cybersecurity, DDoS protection, and server hardening.
Comments coming soon