Getting Started with OpenTelemetry for Application Tracing

Testing 2026-02-14 · 7 min read opentelemetry tracing observability instrumentation monitoring distributed-systems

Getting Started with OpenTelemetry for Application Tracing

OpenTelemetry trace visualization showing distributed request flow

A user reports that the checkout page is slow. You check the application logs -- nothing unusual. You check the database metrics -- queries are fast. You check the server CPU -- normal. The problem is somewhere in the chain of service calls between the user's click and the response, but you have no way to see that chain. You are debugging a distributed system with single-service tools.

OpenTelemetry (OTel) solves this. It is a vendor-neutral standard for collecting traces, metrics, and logs from your applications. Traces show you the full path of a request across services, with timing for each step. Metrics give you aggregate measurements (request count, error rate, latency percentiles). Logs provide event-level detail. Together, they give you the observability you need to understand what your system is actually doing.

OpenTelemetry is not a monitoring platform. It is the instrumentation layer -- the data collection part. You send the data to whatever backend you choose: Jaeger, Grafana Tempo, Honeycomb, Datadog, New Relic, or any other platform that accepts the OpenTelemetry Protocol (OTLP).

Core Concepts

Traces and Spans

A trace represents a single request's journey through your system. It has a unique trace ID that follows the request across service boundaries.

A span is a single operation within a trace. A trace is a tree of spans. For example, a checkout request might create this trace:

checkout-request (trace root, 450ms)
├── validate-cart (50ms)
├── charge-payment (300ms)
│   ├── create-payment-intent (200ms)
│   └── confirm-payment (80ms)
├── update-inventory (60ms)
└── send-confirmation-email (40ms)

Each span has a name, start time, duration, status, and optional attributes (key-value pairs with additional context).

Metrics

Metrics are aggregate numerical measurements:

Counter: Monotonically increasing value (total requests, total errors)
Histogram: Distribution of values (request duration, response size)
Gauge: Current value that can go up or down (active connections, queue depth)

Logs

OpenTelemetry logs are structured log events correlated with traces. When a log line is emitted during a traced request, it includes the trace ID and span ID, letting you jump from a log entry to the full trace.

Setting Up: Node.js / TypeScript

Install Dependencies

npm install @opentelemetry/sdk-node \
  @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

Initialize the SDK

Create tracing.ts -- this must run before any other imports:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "checkout-service",
    [ATTR_SERVICE_VERSION]: "1.4.2",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable noisy instrumentations
      "@opentelemetry/instrumentation-fs": { enabled: false },
    }),
  ],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

Load Tracing Before Your App

// index.ts
import "./tracing"; // Must be first!
import { app } from "./app";

app.listen(3000, () => {
  console.log("Server running on port 3000");
});

The auto-instrumentations-node package automatically instruments HTTP requests, database calls (pg, mysql, mongodb), Express/Fastify routes, and dozens of other libraries. Without writing any manual instrumentation code, you get traces for every inbound and outbound request.

Adding Custom Spans

Auto-instrumentation covers the framework layer. For application-level visibility, add custom spans:

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("checkout-service");

async function processCheckout(cart: Cart, user: User): Promise<Order> {
  return tracer.startActiveSpan("process-checkout", async (span) => {
    try {
      span.setAttribute("user.id", user.id);
      span.setAttribute("cart.item_count", cart.items.length);
      span.setAttribute("cart.total_cents", cart.totalCents);

      // Each of these creates a child span automatically
      // if the called functions also create spans
      const validation = await validateCart(cart);
      const payment = await chargePayment(cart, user);
      await updateInventory(cart);
      const order = await createOrder(cart, user, payment);
      await sendConfirmationEmail(user, order);

      span.setAttribute("order.id", order.id);
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Recording Events

Spans can contain events -- timestamped annotations within the span's lifetime:

async function chargePayment(cart: Cart, user: User): Promise<Payment> {
  return tracer.startActiveSpan("charge-payment", async (span) => {
    span.addEvent("payment.initiated", {
      "payment.provider": "stripe",
      "payment.amount_cents": cart.totalCents,
    });

    const result = await stripe.paymentIntents.create({
      amount: cart.totalCents,
      currency: "usd",
    });

    span.addEvent("payment.completed", {
      "payment.intent_id": result.id,
      "payment.status": result.status,
    });

    span.end();
    return result;
  });
}

Custom Metrics

import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("checkout-service");

// Counter: total checkouts
const checkoutCounter = meter.createCounter("checkouts.total", {
  description: "Total checkout attempts",
});

// Histogram: checkout duration
const checkoutDuration = meter.createHistogram("checkouts.duration_ms", {
  description: "Checkout processing time in milliseconds",
  unit: "ms",
});

// Gauge: active checkouts
const activeCheckouts = meter.createUpDownCounter("checkouts.active", {
  description: "Currently processing checkouts",
});

async function processCheckout(cart: Cart): Promise<Order> {
  const startTime = Date.now();
  activeCheckouts.add(1);

  try {
    const order = await doCheckout(cart);
    checkoutCounter.add(1, { status: "success" });
    return order;
  } catch (error) {
    checkoutCounter.add(1, { status: "error", error_type: error.name });
    throw error;
  } finally {
    activeCheckouts.add(-1);
    checkoutDuration.record(Date.now() - startTime);
  }
}

The OpenTelemetry Collector

In production, you do not send telemetry directly from your application to a backend. Instead, you run the OpenTelemetry Collector as a sidecar or standalone service. The Collector receives data from your applications, processes it (batching, filtering, sampling), and exports it to one or more backends.

Why Use a Collector?

Decouple apps from backends: Change your monitoring vendor without redeploying applications.
Process data: Sample high-volume traces, drop noisy spans, add attributes.
Fan out: Send traces to Jaeger AND Honeycomb simultaneously.
Buffer: The Collector handles retries and backpressure so your app does not.

Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      - name: sample-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  otlp/honeycomb:
    endpoint: api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${HONEYCOMB_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/jaeger, otlp/honeycomb]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]

Run with Docker Compose

# docker-compose.yaml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config", "/etc/otel/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317"         # OTLP gRPC (internal)

Start everything with docker compose up, point your application's exporter to http://localhost:4318, and open http://localhost:16686 to see traces in Jaeger.

OpenTelemetry Collector pipeline showing receivers, processors, and exporters

Context Propagation

For traces to work across service boundaries, the trace ID must travel with the request. This is called context propagation. OpenTelemetry uses the W3C Trace Context standard by default.

When your instrumented service makes an HTTP request, the SDK automatically adds these headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor=value

The receiving service reads these headers and continues the trace. If both services use OpenTelemetry auto-instrumentation, this happens automatically.

For message queues and async workflows, you need to propagate context manually:

import { context, propagation } from "@opentelemetry/api";

// When publishing a message
function publishMessage(queue: string, payload: unknown) {
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);

  queue.publish({
    body: JSON.stringify(payload),
    headers: carrier, // Trace context travels with the message
  });
}

// When consuming a message
function consumeMessage(message: QueueMessage) {
  const parentContext = propagation.extract(
    context.active(),
    message.headers
  );

  context.with(parentContext, () => {
    tracer.startActiveSpan("process-message", (span) => {
      // This span is a child of the publishing span
      processPayload(message.body);
      span.end();
    });
  });
}

Sampling Strategies

In production, tracing every request generates enormous data volumes. Sampling reduces this to a manageable level.

Head sampling: Decide at the start of a trace whether to record it. Simple and low-overhead, but you might miss interesting traces.

import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-node";

// Sample 10% of traces
const sampler = new TraceIdRatioBasedSampler(0.1);

Tail sampling (via Collector): Decide after the trace completes. This lets you keep 100% of error traces and slow traces while sampling routine traffic. Requires the Collector to buffer traces temporarily.

Adaptive sampling: Adjust the sampling rate based on traffic volume. High-traffic services sample less; low-traffic services sample everything.

The practical recommendation: start with 100% sampling in development and staging. In production, use tail sampling via the Collector to keep all errors and slow requests, and probabilistically sample 5-20% of everything else.

Choosing a Backend

OpenTelemetry's vendor-neutral design means you pick the backend that fits your needs:

Jaeger (free, self-hosted): Good for teams that want full control. Stores traces in Elasticsearch, Cassandra, or Badger. The UI is functional but basic. Best for teams already running Elasticsearch.

Grafana Tempo (free, self-hosted): Trace backend that integrates with the Grafana stack. If you already use Grafana for dashboards, Tempo is the natural choice. Uses object storage (S3, GCS) which keeps costs low at scale.

Honeycomb (SaaS): The best query experience for traces. Their "BubbleUp" feature automatically surfaces anomalies. Expensive at scale but exceptional for debugging complex distributed systems.

Datadog/New Relic (SaaS): Full-platform observability with traces, metrics, logs, and more. The advantage is everything in one place. The disadvantage is cost and vendor lock-in.

SigNoz (free, self-hosted or SaaS): Open-source alternative to Datadog. Supports traces, metrics, and logs with a unified UI. Good middle ground between Jaeger and commercial platforms.

Common Pitfalls

Forgetting to end spans: An unended span leaks memory and produces malformed traces. Always use try/finally or the callback form of startActiveSpan.

Too many spans: Instrumenting every function creates noise. Focus on operations that cross boundaries (HTTP calls, database queries, message queue operations) and significant business operations (checkout, payment, notification).

Missing service.name: Without a service.name resource attribute, your traces show up as "unknown_service" in every backend. Always set it.

Not propagating context through async boundaries: If you use worker threads, message queues, or setTimeout, the trace context is lost unless you explicitly propagate it.

Ignoring the Collector: Sending telemetry directly from applications to backends works in development but creates tight coupling in production. Use the Collector.

Conclusion

OpenTelemetry is the standard for application instrumentation. It replaced a fragmented landscape of vendor-specific agents (Datadog APM, New Relic agents, Jaeger clients) with a single, vendor-neutral SDK that works with every backend. The investment you make in OpenTelemetry instrumentation is portable -- switch backends without changing application code.

Start with auto-instrumentation, add custom spans to your critical business paths, run the Collector for processing and routing, and pick a backend that fits your budget. You will wonder how you ever debugged distributed systems without traces.