Observability for Developers: Logging, Metrics, and Tracing

CI/CD 2026-02-09 · 6 min read observability logging metrics grafana prometheus opentelemetry

Observability for Developers: Logging, Metrics, and Tracing

"Add more logging" is the default response when something breaks in production. But unstructured log lines piped to stdout are the least useful form of observability. Modern applications need structured logs, time-series metrics, and distributed traces -- the three pillars -- working together to answer what went wrong and why.

This guide covers the tools and practices that matter for application developers, not platform teams running Kubernetes clusters.

The Three Pillars

Logs are timestamped records of discrete events. They tell you what happened. A request failed, a user signed up, a background job timed out.

Metrics are numeric measurements collected over time. They tell you how the system is performing. Request latency at the 95th percentile, error rate per endpoint, memory usage per service. Metrics are cheap to store and fast to query.

Traces follow a single request as it moves through multiple services. They tell you where time is being spent. A trace is a tree of spans, where each span represents a unit of work (an HTTP request, a database query, a cache lookup).

You need all three. Logs without metrics mean you can't spot trends. Metrics without traces mean you can't diagnose root causes. Traces without logs mean you can't understand the business context of a failure.

The Grafana + Prometheus Stack

Prometheus collects and stores metrics. Grafana visualizes them. Together they form the backbone of most self-hosted observability setups.

Prometheus

Prometheus is a pull-based metrics system. Your application exposes a /metrics endpoint, and Prometheus scrapes it on a schedule. Instrument a Node.js application with prom-client:

import express from "express";
import { Registry, Counter, Histogram, collectDefaultMetrics } from "prom-client";

const register = new Registry();
collectDefaultMetrics({ register });

const httpRequestsTotal = new Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "route", "status"],
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "route"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [register],
});

const app = express();

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, route: req.path });
  res.on("finish", () => {
    httpRequestsTotal.inc({ method: req.method, route: req.path, status: res.statusCode });
    end();
  });
  next();
});

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});

Grafana Dashboards

Grafana connects to Prometheus as a data source. Here is a dashboard config covering the four golden signals -- request rate, error rate, latency, and saturation:

{
  "dashboard": {
    "title": "Web Service Overview",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "timeseries",
        "targets": [{ "expr": "rate(http_requests_total[5m])" }]
      },
      {
        "title": "Latency p95",
        "type": "timeseries",
        "targets": [{ "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))" }]
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "targets": [{ "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100" }]
      }
    ]
  }
}

OpenTelemetry

OpenTelemetry (OTel) is a vendor-neutral standard for collecting traces, metrics, and logs. It replaces the older OpenTracing and OpenCensus projects. The key insight: you instrument once with OTel and export to any compatible backend -- Jaeger, Grafana Tempo, Datadog, Honeycomb, or any OTLP-compatible service. No vendor lock-in.

Instrumenting a Node.js Application

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-metrics-otlp-http

// tracing.ts -- load before your app with: node --import ./tracing.js dist/index.js
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";

const sdk = new NodeSDK({
  serviceName: "my-api",
  traceExporter: new OTLPTraceExporter({ url: "http://otel-collector:4318/v1/traces" }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: "http://otel-collector:4318/v1/metrics" }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({ "@opentelemetry/instrumentation-fs": { enabled: false } }),
  ],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown());

The auto-instrumentation library automatically traces HTTP requests, database queries (pg, mysql, mongodb), Redis calls, and gRPC -- without changing application code.

Adding Custom Spans

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("my-api");

async function processOrder(orderId: string) {
  return tracer.startActiveSpan("process-order", async (span) => {
    span.setAttribute("order.id", orderId);
    try {
      const result = await validatePayment(orderId);
      span.setAttribute("payment.status", result.status);
      return result;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: 2, message: (error as Error).message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Structured Logging

Structured logging means writing logs as JSON objects instead of formatted strings. This makes logs machine-parseable, searchable, and correlatable with traces.

Pino (Recommended)

Pino is the fastest Node.js logger. It outputs JSON by default and has minimal overhead.

import pino from "pino";
import { trace } from "@opentelemetry/api";

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  // Correlate logs with OpenTelemetry traces
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const ctx = span.spanContext();
      return { traceId: ctx.traceId, spanId: ctx.spanId };
    }
    return {};
  },
});

logger.info({ userId: "abc123", action: "login" }, "User logged in");
logger.error({ err, orderId: "xyz789" }, "Payment processing failed");
// Output: {"level":"info","traceId":"a1b2c3d4","userId":"abc123","msg":"User logged in"}

For local development, pipe through pino-pretty for human-readable output.

Winston

Winston is the most popular Node.js logger -- more configurable than Pino but 5-10x slower in benchmarks.

import winston from "winston";

const logger = winston.createLogger({
  level: "info",
  format: winston.format.combine(winston.format.timestamp(), winston.format.json()),
  defaultMeta: { service: "my-api" },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: "error.log", level: "error" }),
  ],
});

Pick Pino unless you specifically need Winston's transport ecosystem for writing logs to multiple destinations. For a hot path like request logging, the performance difference matters.

Error Tracking

Error tracking tools capture exceptions, group them by root cause, and alert you. They're distinct from logging -- an error tracker gives you stack traces, affected user counts, and release regression detection.

Sentry

Sentry is the industry standard. The self-hosted version is free; SaaS starts at $26/month.

import * as Sentry from "@sentry/node";

Sentry.init({
  dsn: "https://[email protected]/0",
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA,
  tracesSampleRate: 0.1,
});

Alternatives

GlitchTip -- Open-source, Sentry-compatible. Uses the same SDK; migration is changing a DSN.
Highlight.io -- Open-source. Combines error tracking with session replay and logging.
Bugsnag -- Commercial. Better mobile SDKs than Sentry. Similar pricing.
Rollbar -- Commercial. Strong real-time alerting. Less popular than Sentry.

Recommendation: Start with Sentry SaaS. Self-host GlitchTip if budget is a hard constraint.

Self-Hosted vs. SaaS

Concern	Self-Hosted	SaaS
Cost at small scale	Higher (server costs, maintenance)	Lower (free tiers, pay-as-you-go)
Cost at large scale	Lower (fixed infrastructure)	Higher (per-seat, per-event pricing)
Setup effort	Significant	Minutes
Data retention	You control it	Vendor-determined (often 15-30 days)
Data privacy	Full control	Data leaves your network
Reliability	You're the on-call team	Vendor handles uptime

Self-host when: Regulatory requirements prevent data leaving your network, volume exceeds 100M+ events/month, or you have a platform team to maintain infrastructure.

Use SaaS when: You're a small team, you need to move fast, or your observability spend is under $500/month (the maintenance time costs more than the subscription).

Grafana Cloud's free tier includes 10,000 series for metrics, 50 GB of logs, and 50 GB of traces per month -- enough for a small production service.

Tool Comparison

Tool	Type	Self-Hosted	SaaS	Best For
Prometheus	Metrics	Yes	Grafana Cloud	Time-series metrics collection
Grafana	Visualization	Yes	Grafana Cloud	Dashboards for any data source
Loki	Logs	Yes	Grafana Cloud	Log aggregation (pairs with Grafana)
Tempo	Traces	Yes	Grafana Cloud	Trace storage (pairs with Grafana)
OpenTelemetry	Instrumentation	N/A	N/A	Vendor-neutral instrumentation
Jaeger	Traces	Yes	No	Trace visualization and analysis
Sentry	Errors	Yes	Yes	Error tracking and alerting
Datadog	All-in-one	No	Yes	Full-stack observability (expensive)
Honeycomb	Traces	No	Yes	High-cardinality trace analysis
New Relic	All-in-one	No	Yes	Full-stack APM (generous free tier)

A Practical Starting Stack

For a team getting started, adopt in this order:

Structured logging with Pino -- Replace console.log with structured JSON logs. Takes an hour, immediately makes logs searchable.
Sentry for error tracking -- Set up in 10 minutes. Catches exceptions you didn't know were happening.
OpenTelemetry auto-instrumentation -- Distributed tracing and basic metrics without changing application code.
Prometheus + Grafana -- Self-hosted via Docker Compose or Grafana Cloud free tier. Dashboard the four golden signals.
Grafana Loki for log aggregation -- Ship structured logs to Loki for search and trace correlation.

Skip Datadog and New Relic unless your company is already paying for them. The Grafana stack gives you equivalent functionality with more control.

The Bottom Line

Observability is not optional for production applications. Start with structured logging -- it's the easiest win and the foundation everything else builds on. Add Sentry for error tracking (it pays for itself the first time it catches a bug before users report it). Use OpenTelemetry for instrumentation so you're never locked into a single vendor. Build your dashboards around the four golden signals -- request rate, error rate, latency, and saturation. The Grafana + Prometheus stack is the industry standard for good reason: open source, battle-tested, and scales from a single Docker Compose file to multi-cluster deployments.