Log Management Tools: ELK Stack, Loki, and Cloud Alternatives

DevOps 2026-02-09 · 10 min read logging elk loki datadog log-management observability

Log Management Tools: ELK Stack, Loki, and Cloud Alternatives

Logs are the most expensive part of most observability stacks. Not because the tools are expensive (though they can be), but because logs are verbose by nature. Every HTTP request, every database query, every error, every debug statement -- they all generate log lines. At scale, you are storing and indexing terabytes of text data, most of which nobody will ever read.

The challenge of log management is not "how do I collect logs" -- that part is straightforward. The challenge is "how do I make logs searchable without going bankrupt, retain them long enough to be useful, and actually find the needle in the haystack when something breaks at 3 AM."

This guide compares the major log management solutions and covers the practices that matter more than any specific tool.

Structured Logging: The Foundation

Before choosing a log management tool, fix your logging. Unstructured log lines are the single biggest source of log management pain:

# Bad: unstructured log line
[2026-02-09 14:32:01] ERROR: Failed to process order 12345 for user [email protected] - connection timeout after 30s

# Good: structured log (JSON)
{"timestamp":"2026-02-09T14:32:01.234Z","level":"error","message":"Failed to process order","order_id":"12345","user_email":"[email protected]","error":"connection timeout","timeout_seconds":30,"service":"order-processor","trace_id":"abc123"}

Structured logs are machine-parseable. You can filter by order_id, aggregate by error type, correlate by trace_id, and alert on level: error without writing regex. Every log management tool works better with structured logs.

Implementing Structured Logging

In Node.js/TypeScript, use pino (fastest) or winston (most popular):

import pino from "pino";

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Use child loggers for request context
app.use((req, res, next) => {
  req.log = logger.child({
    request_id: req.headers["x-request-id"],
    method: req.method,
    path: req.path,
    user_id: req.user?.id,
  });
  next();
});

// Structured logging in handlers
app.post("/orders", async (req, res) => {
  req.log.info({ order_total: req.body.total }, "Processing order");

  try {
    const order = await processOrder(req.body);
    req.log.info({ order_id: order.id }, "Order processed successfully");
    res.json(order);
  } catch (err) {
    req.log.error({ err, order_data: req.body }, "Order processing failed");
    res.status(500).json({ error: "Internal server error" });
  }
});

In Python, use structlog:

import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ]
)

logger = structlog.get_logger()

def process_order(order_id: str, user_id: str):
    log = logger.bind(order_id=order_id, user_id=user_id)
    log.info("processing_order")

    try:
        result = do_process(order_id)
        log.info("order_processed", total=result.total)
    except Exception as e:
        log.error("order_failed", error=str(e))
        raise

Log Levels: Use Them Consistently

Define what each level means for your team and enforce it:

ERROR: Something is broken. A user-visible failure occurred. This should trigger an alert.
WARN: Something unexpected happened, but the system handled it. Worth investigating if frequent.
INFO: Normal operations. Request processed, job completed, service started. The "audit trail" level.
DEBUG: Detailed diagnostic information. Disabled in production by default. Enabled per-service when investigating issues.

A common mistake is logging too much at INFO level. If your INFO logs generate more than a few hundred lines per minute per service, you are probably logging things that should be DEBUG.

The ELK Stack

ELK -- Elasticsearch, Logstash, Kibana -- is the classic self-hosted log management stack. Elasticsearch indexes and stores logs. Logstash (or Filebeat/Fluentd) collects and ships them. Kibana provides the search UI and dashboards.

Architecture

Application -> Filebeat -> Logstash -> Elasticsearch -> Kibana
                  |                        |
              (collection)            (storage + index)

In modern ELK deployments, Filebeat replaces Logstash for log collection (it is lighter and more reliable), and Logstash is used only when you need complex log transformations.

Setup with Docker Compose

version: "3.8"
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    volumes:
      - es-data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    volumes:
      - ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - elasticsearch

volumes:
  es-data:

# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  indices:
    - index: "logs-%{+yyyy.MM.dd}"

Strengths

Full-text search: Elasticsearch is the gold standard for text search. Complex queries across millions of log lines return in milliseconds.
Kibana: Rich visualization, dashboards, and the Discover view for ad-hoc log exploration.
Ecosystem: Beats (Filebeat, Metricbeat, Heartbeat) cover every collection scenario.
Flexibility: Index any JSON structure. No schema required upfront.

Weaknesses

Resource hungry: Elasticsearch needs significant memory and storage. A production cluster typically requires 3+ nodes with 16+ GB RAM each.
Operational complexity: Managing Elasticsearch clusters (shard allocation, index lifecycle, JVM tuning) is a specialized skill. Many teams underestimate this.
Cost at scale: Storage and compute costs grow linearly with log volume. At high volumes, ELK is one of the most expensive self-hosted options.
License changes: Elastic moved from Apache 2.0 to SSPL/Elastic License, which led to the OpenSearch fork. Know which license you are running.

Cost Management Tips

ELK costs are driven by storage and indexing. Reduce them with:

Index lifecycle management (ILM): Automatically move old indices to cheaper storage tiers and delete them after retention expires.
Data streams: Use time-based data streams instead of manually named indices.
Field mapping limits: Avoid mapping explosions by being deliberate about which fields are indexed.
Sampling: For high-volume, low-value logs (access logs, health checks), sample 10% instead of indexing everything.

// ILM policy: hot for 7 days, warm for 23 days, delete after 30 days
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "7d" }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Grafana Loki

Loki takes a fundamentally different approach to log management. Where Elasticsearch indexes the full content of every log line, Loki indexes only the metadata (labels) and stores log content as compressed chunks. This makes it dramatically cheaper to run but with different query trade-offs.

The Key Insight

Loki's philosophy is "like Prometheus, but for logs." Instead of full-text indexing, it uses labels to identify log streams:

{service="api-server", environment="production", level="error"}

When you query Loki, it first narrows down to the relevant log streams using labels, then does a brute-force search through the compressed log content. This is slower than Elasticsearch for arbitrary text search but fast enough for most operational use cases -- and vastly cheaper.

Setup

Loki integrates naturally with the Grafana stack:

# docker-compose.yml
version: "3.8"
services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

  grafana:
    image: grafana/grafana:10.3.0
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

volumes:
  loki-data:

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 30d

compactor:
  working_directory: /loki/retention
  retention_enabled: true

LogQL: Loki's Query Language

LogQL looks like PromQL with log-specific extensions:

# Find error logs from the API server
{service="api-server"} |= "error"

# Parse JSON logs and filter by field
{service="api-server"} | json | level="error" | order_id != ""

# Count errors per service over the last hour
count_over_time({level="error"}[1h]) by (service)

# Find slow requests (parse duration from structured logs)
{service="api-server"} | json | duration > 5s

# Top 10 error messages
{service="api-server"} | json | level="error" |
  top 10 by (message) | count_over_time([1h])

# Pattern matching for unstructured logs
{service="legacy-app"} |~ "timeout|connection refused|ECONNRESET"

Strengths

Cost-effective: 10-100x cheaper than Elasticsearch for the same log volume because it does not index full text.
Simple to operate: Single binary, minimal configuration. No JVM tuning, no shard management.
Grafana integration: Logs, metrics, and traces in the same UI. Click from a metric spike to the relevant logs.
Object storage: Production Loki stores chunks in S3/GCS/MinIO, which is extremely cheap.

Weaknesses

Slower ad-hoc search: Searching for an arbitrary string across all logs is slower than Elasticsearch because Loki must scan compressed chunks.
Label cardinality limits: Too many unique label values (like user IDs as labels) will kill performance. This is the most common mistake with Loki.
Less mature: Fewer features than Elasticsearch, smaller community, fewer integrations.
No full-text search index: If your primary use case is "search for this string across all logs from all services from the last 30 days," Elasticsearch is faster.

Datadog Logs

Datadog is the dominant SaaS observability platform. Datadog Logs provides log collection, indexing, search, and alerting as a managed service with deep integration into the rest of the Datadog platform (APM, metrics, infrastructure monitoring).

Setup

Datadog uses an agent for log collection:

# datadog-agent config: /etc/datadog-agent/datadog.yaml
api_key: YOUR_API_KEY
logs_enabled: true

# /etc/datadog-agent/conf.d/myapp.d/conf.yaml
logs:
  - type: file
    path: /var/log/myapp/*.log
    service: myapp
    source: nodejs
    sourcecategory: application

Or pipe logs directly from your application:

import { createLogger, format, transports } from "winston";

const logger = createLogger({
  format: format.json(),
  transports: [
    new transports.Http({
      host: "http-intake.logs.datadoghq.com",
      path: `/api/v2/logs?dd-api-key=${process.env.DD_API_KEY}`,
      ssl: true,
    }),
  ],
});

Strengths

No infrastructure to manage: Datadog handles storage, indexing, scaling, and retention.
Unified platform: Logs, APM traces, metrics, and dashboards in one place. Correlate a log line with the trace that produced it and the host metrics at that time.
Log patterns: Datadog automatically clusters similar log lines into patterns, helping you spot anomalies.
Flexible retention: Rehydrate archived logs on demand. Store everything cheaply in archives, index only what you need.

Weaknesses

Expensive: Datadog charges per GB ingested and per GB indexed. At scale (100+ GB/day), costs can be eye-watering -- $2-3 per GB ingested, and that is before indexing and retention costs.
Vendor lock-in: Once your dashboards, alerts, and workflows are in Datadog, switching is painful.
Complexity: The platform has so many features that configuration and cost optimization become full-time concerns.

Cost Management

Datadog's pricing model rewards careful log management:

Exclusion filters: Drop noisy, low-value logs before they are indexed. Health check logs, debug logs in production, and known-benign errors are common candidates.
Custom pipelines: Parse and enrich logs at ingestion time. Extract structured fields so you can filter and aggregate efficiently.
Index management: Create multiple indexes with different retention periods. High-value logs (errors, security events) get 30-day retention. Low-value logs (access logs) get 3-day retention.
Log archives: Send all logs to S3/GCS for long-term storage at object storage prices. Rehydrate specific time ranges when you need them.

Cloud-Native Options

AWS CloudWatch Logs

CloudWatch Logs is the default for AWS workloads. Lambda, ECS, EKS, and EC2 all ship logs to CloudWatch with minimal configuration.

Pros: Zero setup for AWS services. Integrated with CloudWatch Alarms and dashboards. Logs Insights provides a SQL-like query language. Pay per GB ingested ($0.50/GB) and stored ($0.03/GB/month).

Cons: The query language is limited compared to Elasticsearch or LogQL. The UI is functional but not pleasant. Cross-account and cross-region log aggregation is cumbersome. CloudWatch Logs Insights queries can be slow on large datasets.

Best for: AWS-heavy teams that want simplicity and do not need advanced search. Good enough for most small-to-medium workloads.

Google Cloud Logging

Cloud Logging (formerly Stackdriver) integrates deeply with GCP services. Logs are automatically collected from GKE, Cloud Run, Cloud Functions, and Compute Engine.

Pros: Automatic collection from GCP services. Powerful query syntax. Log-based metrics (turn log patterns into time-series metrics). Integrated with Cloud Monitoring for alerting.

Cons: Pricing is complex (free allotment, then per-GB). The UI can be slow. Log routing and exclusion configuration is not intuitive.

Azure Monitor Logs

Azure's log management is built on Log Analytics workspaces and uses KQL (Kusto Query Language) for queries.

Pros: Deep Azure integration. KQL is a genuinely powerful query language. Application Insights for application-level logging and tracing.

Cons: KQL has a steep learning curve. Pricing is per-GB ingested. The portal experience is cluttered.

Logs vs. Metrics vs. Traces: When to Use What

Logs are not always the answer. Many things people log should be metrics or traces instead.

Use metrics when you want to track a number over time. "How many requests per second?" "What is the 95th percentile latency?" "How much memory is the service using?" Metrics are cheap to store, fast to query, and ideal for dashboards and alerts.

Use traces when you want to understand the flow of a single request through multiple services. "Why was this request slow?" "Which downstream service caused the timeout?" Traces are request-scoped and show you the call graph.

Use logs when you need the full context of a specific event. "What was the exact error message?" "What was the request payload that caused the failure?" "What happened in the 30 seconds before the crash?" Logs are event-scoped and provide the detail that metrics and traces cannot.

The most common mistake is logging metrics. Do not write logger.info("Request took 234ms") -- emit a histogram metric instead. Do not write logger.info("Queue depth: 42") -- expose that as a gauge. Logs that contain numbers you want to aggregate over time should be metrics.

Choosing a Solution

ELK Stack: Choose when you need powerful full-text search, have a team that can operate Elasticsearch, and want to self-host. Best for medium-to-large teams with dedicated platform engineering.

Grafana Loki: Choose when you want cost-effective log management, already use Grafana for metrics, and can live with label-based (rather than full-text) search. Best for teams that prioritize operational simplicity and cost.

Datadog Logs: Choose when you want a fully managed solution, are already using Datadog for other observability, and have the budget. Best for teams that value integration and are willing to pay for convenience.

Cloud-native (CloudWatch/Cloud Logging/Azure Monitor): Choose when you are all-in on a single cloud provider and want zero-setup log collection. Best for small teams and simple architectures.

Regardless of which tool you choose, the practices matter more: structure your logs as JSON, use log levels consistently, set retention policies aggressively, and always ask "should this be a metric instead?" before adding a new log line. The cheapest log is the one you never generate.