Developer Metrics and Engineering Intelligence Tools

Tools 2026-02-09 · 9 min read dora-metrics developer-experience analytics engineering-metrics productivity

Developer Metrics and Engineering Intelligence Tools

Measuring developer productivity is a minefield. Get it right and you have early warning signals for systemic problems, evidence for resourcing decisions, and data to improve your engineering processes. Get it wrong and you have a surveillance system that destroys trust, incentivizes gaming, and measures activity instead of outcomes.

The core principle: measure the system, not the individuals. Good metrics tell you whether your engineering organization is healthy and improving. They do not tell you which developer is "performing" and which is "slacking." If your metrics can be used to rank individual developers, you are measuring the wrong things.

DORA Metrics: The Gold Standard

The DORA (DevOps Research and Assessment) metrics come from years of research by the team behind the "Accelerate" book and the annual State of DevOps reports. They measure software delivery performance through four metrics that correlate with organizational performance:

Deployment Frequency

How often does your team deploy to production? High-performing teams deploy on demand (multiple times per day). Low performers deploy between once per month and once every six months.

This metric reflects your CI/CD pipeline maturity, your confidence in automated testing, and your ability to ship small changes. Low deployment frequency usually means large, risky deployments -- which means more things break and take longer to fix.

Lead Time for Changes

How long does it take from code commit to code running in production? High performers measure this in under a day. Low performers measure it in months.

Lead time includes everything: code review wait time, CI pipeline duration, approval gates, staging validation, and deployment mechanics. A long lead time usually points to bottlenecks in code review (PRs sitting for days), slow CI (30-minute test suites), or heavyweight approval processes.

Change Failure Rate

What percentage of deployments cause a failure in production? High performers have a change failure rate of 0-15%. Low performers are above 46%.

This metric counterbalances deployment frequency. Deploying ten times a day is not impressive if half those deployments break something. Change failure rate measures the quality of your testing, code review, and deployment practices.

Mean Time to Recovery (MTTR)

When a deployment does cause a failure, how long does it take to restore service? High performers recover in under an hour. Low performers take between one week and one month.

MTTR depends on your monitoring (how fast you detect the problem), your deployment pipeline (how fast you can deploy a fix or rollback), and your incident response process (how fast humans can diagnose and act).

Why DORA Works

The four metrics are deliberately balanced. You cannot game them by optimizing one at the expense of others:

Deploying more frequently without increasing change failure rate means your testing and review processes are genuinely good.
Low lead time without high failure rates means your pipeline is both fast and safe.
Low MTTR means your organization can respond to problems, not just prevent them.

No single metric is useful in isolation. A team that deploys once a quarter with zero failures is not high-performing -- they are probably batching risk and getting lucky.

Measuring DORA Metrics

From Your CI/CD Pipeline

The most reliable source of DORA data is your CI/CD system. Deployment frequency and lead time can be derived from deployment events and commit timestamps:

# Deployment frequency: count deploys per week
gh api repos/myorg/myapp/deployments \
    --paginate \
    --jq '.[].created_at' | \
    while read date; do date -d "$date" +%Y-%W; done | \
    sort | uniq -c

# Lead time: time from first commit in PR to deployment
# This requires correlating PR merge times with deployment times
gh pr list --state merged --json mergedAt,commits --limit 100 | \
    jq '.[] | {merged: .mergedAt, first_commit: .commits[0].committedDate}'

Purpose-Built Tools

Several tools automate DORA measurement:

Sleuth connects to your GitHub/GitLab, CI/CD, and incident management tools to calculate DORA metrics automatically. It tracks deployments, correlates them with incidents, and produces dashboards showing your four metrics over time.

# Sleuth integration example -- track deployments via webhook
# POST to Sleuth API after each deployment
curl -X POST "https://app.sleuth.io/api/1/deployments/myorg/myapp/register_deploy" \
    -H "Authorization: apikey $SLEUTH_API_KEY" \
    -d '{
        "sha": "'"$GIT_SHA"'",
        "environment": "production",
        "date": "'"$(date -u +%Y-%m-%dT%H:%M:%S)"'"
    }'

Swarmia focuses on engineering effectiveness, combining DORA metrics with working agreement tracking and investment analysis (how much time goes to features vs. bugs vs. tech debt).

Faros is an open-source option that aggregates data from multiple tools (Jira, GitHub, CI/CD, PagerDuty) into a unified data model and produces DORA dashboards.

Beyond DORA: Developer Experience Metrics

DORA metrics measure the output of your engineering system. Developer experience (DX) metrics measure the input -- how productive and satisfied your developers feel. Both matter.

DX Surveys

The most direct way to measure developer experience is to ask developers. DX surveys (popularized by the DX research group led by Margaret-Anne Storey, Nicole Forsgren, and Michaela Greiler) use validated survey instruments to measure:

Developer satisfaction: How satisfied are you with your development tools, processes, and environment?
Perceived productivity: How productive do you feel on a typical day?
Friction points: What slows you down most? (Build times, code review wait, unclear requirements, flaky tests)
Flow state: How often can you enter and sustain a state of focused work?

Run these quarterly. Track trends, not absolute numbers. A 10-point drop in "I can get into flow state" is a stronger signal than most quantitative metrics.

Cycle Time Breakdown

Cycle time is the total time from "work started" to "work deployed." Breaking it down reveals where time is actually spent:

Coding time: Time from first commit to PR opened. Measures task complexity and developer velocity.
Pickup time: Time from PR opened to first review. Measures reviewer availability and team norms.
Review time: Time from first review to approval. Measures review thoroughness and back-and-forth.
Deploy time: Time from PR merged to deployed in production. Measures CI/CD pipeline efficiency.

Total cycle time: 4.5 days
  Coding:  1.5 days  (33%)
  Pickup:  1.0 day   (22%) <-- bottleneck
  Review:  1.5 days  (33%)
  Deploy:  0.5 days  (11%)

In this example, pickup time is a full day -- meaning PRs sit for a day before anyone looks at them. This is a common bottleneck that teams can fix with simple working agreements ("review PRs within 4 hours").

Build and CI Metrics

Slow builds kill productivity. Flaky tests destroy trust in the test suite. Measure:

Build time: p50 and p95 CI pipeline duration. Track weekly trends.
Flaky test rate: Percentage of test runs that fail non-deterministically. Anything above 2% is corrosive.
Build success rate: Percentage of CI runs that pass. Below 80% means developers are committing broken code too often.

# GitHub Actions: get workflow run durations for the last 100 runs
gh api repos/myorg/myapp/actions/runs \
    --paginate --jq '.workflow_runs[:100] | .[] |
    select(.conclusion == "success") |
    {duration: ((.updated_at | fromdateiso8601) - (.created_at | fromdateiso8601)), name: .name}' | \
    jq -s 'group_by(.name) | .[] |
    {workflow: .[0].name, p50: (sort_by(.duration) | .[length/2].duration), count: length}'

Engineering Intelligence Platforms

Several platforms aggregate metrics from your development tools into unified dashboards.

LinearB

LinearB connects to GitHub/GitLab, Jira/Linear, and CI/CD systems. It provides:

Cycle time breakdown: Coding, pickup, review, and deploy times per team and per PR.
DORA metrics: Calculated automatically from your pipeline data.
Working agreements: Set targets ("review PRs within 4 hours") and track compliance.
Investment allocation: How much time goes to new features, bugs, tech debt, and operational work.

LinearB is opinionated about what good looks like, which is both its strength and its limitation. The benchmarks and recommendations are useful for teams that do not know where to start. They can feel prescriptive for teams with domain-specific constraints.

Jellyfish

Jellyfish targets engineering leadership -- VPs and CTOs who need to connect engineering work to business outcomes. It maps engineering investment to product initiatives and business metrics. Where LinearB focuses on developer-level flow metrics, Jellyfish focuses on portfolio-level allocation and ROI.

If your question is "are our teams healthy and shipping efficiently?" -- LinearB. If your question is "are we investing engineering resources in the right things?" -- Jellyfish.

Haystack

Haystack is a newer entrant focused specifically on developer experience. It emphasizes PR-level insights -- highlighting PRs that are stuck, reviews that are bottlenecked, and contributors who might be overloaded. The UI is cleaner than most competitors, and the focus on actionable signals (rather than dashboards with dozens of charts) is a strength.

Propel (Open Source)

Propel is an open-source engineering metrics platform that provides DORA metrics, cycle time analysis, and contributor insights. If you want metrics without sending your data to a third-party SaaS, Propel is worth evaluating.

Useful vs. Harmful Metrics

Metrics That Help

DORA metrics (at the team level): Measure system performance, not individual performance.
Cycle time breakdown: Reveals process bottlenecks that teams can fix.
Build time trends: Highlights when CI is degrading before it becomes painful.
Flaky test rate: Surfaces a real productivity killer.
Developer satisfaction surveys: Catches problems that quantitative metrics miss.
Incident frequency and MTTR: Measures operational health.

Metrics That Harm

Lines of code: Incentivizes verbose code. A developer who deletes 500 lines of dead code is more productive than one who adds 500 lines of new code.
Number of commits: Incentivizes small, meaningless commits. Some work legitimately requires one large commit.
Number of PRs: Incentivizes splitting work into trivially small PRs. Some features need large PRs.
Individual cycle time: Comparing developer A's cycle time to developer B's ignores task complexity, seniority, context switching, and mentoring load.
Hours worked / activity tracking: Measures presence, not output. Developers who spend 6 focused hours are more productive than those who spend 10 distracted hours. Surveillance tools like keystroke loggers, screenshot capture, or "time active in IDE" are toxic. They destroy trust, encourage performative busyness, and drive away your best developers.

The litmus test: if a metric can be improved by gaming behavior rather than actually improving, it is a bad metric. Lines of code can be gamed by being verbose. Deployment frequency cannot be gamed without actually shipping code more often (which is the goal).

Implementing Metrics Pragmatically

Start Small

Do not deploy a full engineering intelligence platform on day one. Start with:

Deployment frequency: Count your deploys per week. You probably already know this number.
Lead time: Measure time from PR merge to deploy. Your CI system has this data.
One survey question: "On a scale of 1-10, how productive did you feel this week?" Ask it in your weekly retro.

These three data points, tracked over time, tell you whether things are getting better or worse.

Automate Collection

Manual metric collection does not scale and gets abandoned. Automate everything:

# GitHub Action: report deployment to your metrics system
name: Track Deployment
on:
  workflow_run:
    workflows: ["Deploy to Production"]
    types: [completed]

jobs:
  track:
    if: github.event.workflow_run.conclusion == 'success'
    runs-on: ubuntu-latest
    steps:
      - name: Record deployment
        run: |
          curl -X POST "$METRICS_API/deployments" \
            -H "Authorization: Bearer $METRICS_TOKEN" \
            -d '{
              "service": "myapp",
              "sha": "${{ github.event.workflow_run.head_sha }}",
              "timestamp": "${{ github.event.workflow_run.updated_at }}",
              "environment": "production"
            }'

Review Regularly, Act on Findings

Metrics that nobody looks at are waste. Review your metrics monthly with the team:

Are trends improving or degrading?
What is the biggest bottleneck in cycle time?
Are any teams consistently below the baseline?
What one thing could we change to improve the most impactful metric?

The value of metrics is not the numbers -- it is the conversations they trigger and the actions they prompt. A dashboard that nobody discusses is just a decoration.

The Human Side

The best engineering teams use metrics as a mirror, not a leash. They measure their processes to identify systemic problems, not to evaluate individual performance. They combine quantitative data with qualitative feedback (surveys, retros, 1:1s) because numbers alone miss the full picture.

If you are an engineering leader adopting metrics for the first time, involve your team in choosing what to measure. Explain why you are measuring it. Share the data openly. Use it to improve processes, not to punish people. The moment developers feel surveilled rather than supported, the metrics become worse than useless -- they become actively harmful to the culture you are trying to build.

Measure the system. Trust the people.