OpenTelemetry: The Complete Developer's Guide to Distributed Tracing

Observability 2026-02-14 · 8 min read opentelemetry tracing observability distributed-systems instrumentation

OpenTelemetry: The Complete Developer's Guide to Distributed Tracing

You deploy a microservices architecture. A user reports that checkout is slow. You check the API gateway logs -- response time looks normal. You check the order service -- fine. The payment service -- fine. The inventory service -- also fine. But somehow the end-to-end request takes 4 seconds. Where did the time go?

OpenTelemetry distributed tracing between services

This is the problem distributed tracing solves, and OpenTelemetry (OTel) is how you implement it without locking yourself into a specific vendor. It's the CNCF project that unified OpenTracing and OpenCensus into a single standard for telemetry data. Every major observability platform -- Datadog, Grafana, Honeycomb, New Relic, Jaeger -- now speaks OpenTelemetry.

The Three Pillars, Explained Simply

OpenTelemetry deals with three types of telemetry data. You've heard these called "the three pillars of observability," but that framing obscures how they actually work together.

Traces follow a single request across service boundaries. A trace is a tree of "spans" -- each span represents a unit of work (an HTTP request, a database query, a function call). When service A calls service B which calls service C, you get a trace showing exactly how long each step took and where failures occurred.

Metrics are aggregated measurements over time -- request count, error rate, response time percentiles, queue depth. Unlike traces (which capture individual requests), metrics summarize behavior across all requests. They're cheap to collect and ideal for dashboards and alerting.

Logs are timestamped text records of events. The key insight from OTel: logs become much more useful when they're correlated with traces. Instead of searching through millions of log lines, you find the trace for a slow request and see exactly which log entries belong to it.

Core Concepts

Before writing code, you need to understand a few OTel primitives.

Spans and Traces

A span has:

A name (e.g., "HTTP GET /api/orders")
A start time and duration
A parent span ID (except for the root span)
Attributes (key-value metadata)
Events (timestamped annotations within the span)
A status (OK, ERROR, UNSET)

A trace is the entire tree of spans that originates from a single root span. The trace ID propagates across service boundaries via HTTP headers (typically traceparent from the W3C Trace Context standard).

Trace: abc123
├── [50ms] HTTP GET /checkout          (api-gateway)
│   ├── [12ms] HTTP POST /orders       (order-service)
│   │   └── [8ms] INSERT INTO orders   (postgres)
│   ├── [180ms] HTTP POST /payment     (payment-service)   <-- slow!
│   │   ├── [150ms] Stripe API call    (external)          <-- root cause
│   │   └── [3ms] UPDATE orders        (postgres)
│   └── [5ms] HTTP POST /inventory     (inventory-service)
│       └── [2ms] UPDATE stock         (postgres)

Looking at this trace, you immediately see the Stripe API call is responsible for the slow checkout. Without tracing, you'd be guessing.

The OTel Collector

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. You deploy it as a sidecar or standalone service, and your applications send telemetry to it instead of directly to your backend.

App → OTel Collector → Jaeger (traces)
                     → Prometheus (metrics)
                     → Loki (logs)

This architecture means you can switch observability backends without changing application code. The Collector handles batching, retry, sampling, and data transformation.

Instrumenting a Node.js Application

Here's how to add OpenTelemetry to an Express application. The process has three parts: install the SDK, configure providers, and add instrumentation.

Installation

npm install @opentelemetry/sdk-node \
  @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

Configuration

Create a tracing.ts file that initializes OTel before your app starts:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '1.4.2',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4318/v1/metrics',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable noisy fs instrumentation
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

Load this before your application code:

node --require ./tracing.js ./server.js
# Or with ts-node:
node --require ts-node/register --require ./tracing.ts ./server.ts

Auto-Instrumentation vs Manual Spans

The auto-instrumentation package automatically creates spans for HTTP requests, database queries, gRPC calls, and many other libraries. This gives you 80% of the value with zero code changes.

For the remaining 20%, add manual spans around business logic:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(orderId: string, items: OrderItem[]) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);
    span.setAttribute('order.item_count', items.length);

    try {
      // Validate inventory
      await tracer.startActiveSpan('validateInventory', async (childSpan) => {
        for (const item of items) {
          const available = await checkStock(item.sku, item.quantity);
          if (!available) {
            childSpan.addEvent('insufficient_stock', {
              'item.sku': item.sku,
              'item.requested': item.quantity,
            });
            throw new Error(`Insufficient stock for ${item.sku}`);
          }
        }
        childSpan.end();
      });

      // Process payment
      const paymentResult = await processPayment(orderId, calculateTotal(items));
      span.setAttribute('payment.transaction_id', paymentResult.transactionId);

      span.setStatus({ code: SpanStatusCode.OK });
      return { success: true, transactionId: paymentResult.transactionId };
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Instrumenting Python

Python's OTel SDK follows the same pattern. Auto-instrumentation covers Flask, Django, FastAPI, SQLAlchemy, and most common libraries.

pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-sqlalchemy \
  opentelemetry-instrumentation-requests

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configure the tracer
resource = Resource.create({
    "service.name": "user-service",
    "service.version": "2.1.0",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument Flask and SQLAlchemy
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.route('/users/<user_id>')
def get_user(user_id):
    with tracer.start_as_current_span("fetch_user_profile") as span:
        span.set_attribute("user.id", user_id)
        user = db.session.query(User).get(user_id)
        if not user:
            span.set_attribute("user.found", False)
            abort(404)
        span.set_attribute("user.found", True)
        return jsonify(user.to_dict())

Setting Up the OTel Collector

The Collector configuration is YAML-based. Here's a production-ready config:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Sample 10% of traces in production
  probabilistic_sampler:
    sampling_percentage: 10

  # Always keep traces with errors
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Deploy with Docker Compose:

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus metrics

Sampling Strategies

In production, you can't trace every request -- the data volume and cost would be enormous. Sampling strategies let you collect enough data to debug problems while keeping costs manageable.

Strategy	How It Works	Best For
Head-based (probabilistic)	Decide at the start of a trace whether to sample it	High-throughput services with uniform traffic
Tail-based	Decide after the trace completes, based on outcomes	Keeping all errors and slow requests
Rate-limiting	Sample up to N traces per second	Controlling exact ingestion volume
Always-on (debug)	Sample 100%	Development and staging environments

Tail-based sampling is the most useful for production debugging because it guarantees you capture the traces that matter -- errors and high-latency requests -- while sampling down the happy path.

Context Propagation: How Traces Cross Service Boundaries

The magic of distributed tracing is that a single trace follows a request across multiple services. This works through context propagation -- trace context is serialized into HTTP headers (or gRPC metadata) and deserialized by the receiving service.

The W3C Trace Context standard defines two headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             ^  ^                                ^                ^
             |  trace-id (128-bit)               parent-id        flags
             version                             (64-bit)         (sampled)

When service A calls service B, the OTel SDK automatically:

Injects the current trace context into outgoing request headers
Extracts the trace context from incoming request headers
Creates child spans that reference the parent span ID

If you're using auto-instrumentation, this happens transparently. For custom HTTP clients or message queues, you might need to propagate context manually:

import { propagation, context } from '@opentelemetry/api';

// Inject context into outgoing request headers
const headers = {};
propagation.inject(context.active(), headers);
await fetch('http://other-service/api', { headers });

// Extract context from incoming request
const parentContext = propagation.extract(context.active(), req.headers);
context.with(parentContext, () => {
  // Spans created here will be children of the incoming trace
});

Custom Metrics

Beyond traces, OTel's metrics API lets you define application-specific metrics:

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('order-service');

// Counter: things that only go up
const ordersCreated = meter.createCounter('orders.created', {
  description: 'Total number of orders created',
});

// Histogram: distribution of values
const orderValue = meter.createHistogram('orders.value', {
  description: 'Order value in cents',
  unit: 'cents',
});

// Up-down counter: things that go up and down
const activeConnections = meter.createUpDownCounter('db.connections.active', {
  description: 'Number of active database connections',
});

// Usage
ordersCreated.add(1, { 'order.type': 'subscription', 'order.region': 'us-west' });
orderValue.record(4999, { 'payment.method': 'card' });
activeConnections.add(1);
// ... later
activeConnections.add(-1);

Debugging Common OTel Issues

Traces aren't showing up: Check that your exporter endpoint is reachable. The most common mistake is using localhost:4317 when the Collector is running in a different container. Use the service name (otel-collector:4317) in Docker Compose.

Spans are disconnected: Context propagation is broken somewhere. Check that auto-instrumentation covers the HTTP client library you're using. For async operations, ensure you're not losing context across setTimeout or event emitter boundaries.

High memory usage in Collector: The batch processor buffers spans before exporting. Reduce send_batch_size or increase export frequency. Add the memory_limiter processor (you should always have this).

Missing attributes: Semantic conventions define standard attribute names. Use http.request.method instead of method or httpMethod. Consistent naming lets your observability backend correlate data across services.

Where to Start

If you're adding observability to an existing system, start here:

Deploy the OTel Collector as a central telemetry pipeline. Even if you only have one service, this separates your app from your backend choice.
Add auto-instrumentation to your most critical service. This gives you HTTP and database traces with minimal effort.
Export to Jaeger (open source, free) for local development and initial exploration.
Add manual spans around business logic that auto-instrumentation doesn't cover.
Implement tail-based sampling before going to production -- you want all error traces but don't need every successful health check.

The investment in instrumentation pays for itself the first time you debug a cross-service latency issue in minutes instead of hours.