Profiling and Benchmarking Tools for Developers
Profiling and Benchmarking Tools for Developers
"Make it work, make it right, make it fast" — but when it's time for "fast," you need tools that show you where the time actually goes. Guessing at performance bottlenecks is reliably wrong. Profiling tools show you the truth.
This guide covers practical profiling and benchmarking tools across languages and use cases.
Flamegraphs
Flamegraphs are the single most useful performance visualization. They show where your program spends time as a stack of function calls — wide bars mean more time.
Generating Flamegraphs
Node.js:
# Record a CPU profile while running your app
node --cpu-prof --cpu-prof-interval=100 app.js
# Or with clinic.js (more user-friendly)
npx clinic flame -- node app.js
clinic.js automatically generates an interactive flamegraph HTML file. It's the fastest path from "my Node app is slow" to "here's the function causing the problem."
Python:
pip install py-spy
# Profile a running Python process
py-spy record -o profile.svg --pid 12345
# Profile a command
py-spy record -o profile.svg -- python my_script.py
# Live top-like view of a running process
py-spy top --pid 12345
py-spy works without restarting your application or modifying code. It attaches to a running process and samples the call stack.
Go:
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// Add this to expose pprof endpoints
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// ... rest of your application
}
# Generate a CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Generate a flamegraph
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
Go's built-in pprof produces production-safe profiles with minimal overhead.
Rust:
# Using cargo-flamegraph (wraps perf on Linux, dtrace on macOS)
cargo install flamegraph
cargo flamegraph --bin my-app
# Or using samply for interactive profiling
cargo install samply
samply record ./target/release/my-app
Reading Flamegraphs
The x-axis is not time — it's sorted alphabetically. Width is what matters: wider bars = more time in that function. Look for:
- Wide plateaus at the top: Leaf functions consuming the most CPU
- Wide bars deep in the stack: Framework or library code you can't change (but might be calling too often)
- Unexpected functions: Why is JSON parsing taking 30% of request time?
HTTP Load Testing
hey
hey is a simple HTTP load testing tool. It replaced ApacheBench (ab) for most use cases.
brew install hey
# Send 10,000 requests with 100 concurrent workers
hey -n 10000 -c 100 http://localhost:3000/api/users
# With custom headers and POST body
hey -n 1000 -c 50 \
-H "Authorization: Bearer token123" \
-H "Content-Type: application/json" \
-m POST \
-d '{"name":"test"}' \
http://localhost:3000/api/users
hey gives you latency distribution (p50, p90, p99), requests/second, and error rates. It's the quickest way to answer "how many requests per second can this endpoint handle?"
k6
k6 is a modern load testing tool that uses JavaScript for test scripts. It's far more capable than hey for complex scenarios.
// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '30s', target: 20 }, // Ramp up to 20 users
{ duration: '1m', target: 20 }, // Stay at 20 users
{ duration: '10s', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<200'], // 95% of requests under 200ms
http_req_failed: ['rate<0.01'], // Less than 1% failure rate
},
};
export default function () {
const loginRes = http.post('http://localhost:3000/api/login', JSON.stringify({
email: '[email protected]',
password: 'password123',
}), { headers: { 'Content-Type': 'application/json' } });
check(loginRes, { 'login succeeded': (r) => r.status === 200 });
const token = loginRes.json('token');
const usersRes = http.get('http://localhost:3000/api/users', {
headers: { Authorization: `Bearer ${token}` },
});
check(usersRes, {
'status is 200': (r) => r.status === 200,
'has users': (r) => r.json('users').length > 0,
});
sleep(1);
}
k6 run load-test.js
k6 can simulate realistic user flows (login, browse, checkout), define pass/fail thresholds, and output results to Grafana, Datadog, or other dashboards.
oha
oha is a Rust-based HTTP load tester with a beautiful terminal UI that shows real-time latency distribution.
brew install oha
# Basic load test with TUI
oha -n 10000 -c 100 http://localhost:3000/api/health
# JSON output for scripting
oha -n 5000 -c 50 --json http://localhost:3000/api/users
The real-time histogram showing latency distribution as the test runs is genuinely useful — you can see if latency degrades over time.
Microbenchmarking
hyperfine
hyperfine benchmarks shell commands with statistical rigor.
brew install hyperfine
# Compare two commands
hyperfine 'node build.js' 'bun build.ts'
# With warmup runs
hyperfine --warmup 3 'python process.py' 'python process_optimized.py'
# Export results
hyperfine --export-markdown results.md 'grep -r pattern .' 'rg pattern .'
hyperfine handles warmup runs, statistical outlier detection, and comparative analysis. It's the right tool for "which is faster, A or B?"
Bun's Built-in Benchmarking
// bench.ts
const data = Array.from({ length: 10000 }, (_, i) => ({ id: i, name: `user-${i}` }));
Bun.bench("JSON.stringify", () => {
JSON.stringify(data);
});
Bun.bench("structuredClone", () => {
structuredClone(data);
});
bun bench.ts
Go Benchmarks
Go has built-in benchmarking in its testing framework:
// sort_bench_test.go
func BenchmarkSort(b *testing.B) {
data := make([]int, 10000)
for i := range data {
data[i] = rand.Intn(10000)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
sorted := make([]int, len(data))
copy(sorted, data)
sort.Ints(sorted)
}
}
go test -bench=. -benchmem ./...
# Compare benchmarks between commits
go install golang.org/x/perf/cmd/benchstat@latest
git stash && go test -bench=. -count=10 > old.txt
git stash pop && go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt
benchstat shows whether performance changes are statistically significant, which prevents false conclusions from noisy benchmarks.
Memory Profiling
Node.js
# Generate a heap snapshot
node --inspect app.js
# Then open chrome://inspect and take a heap snapshot
# Or use clinic.js for heap profiling
npx clinic heap -- node app.js
Valgrind (C/C++/Rust)
# Memory leak detection
valgrind --leak-check=full ./my-program
# Cache profiling (find cache-unfriendly code)
valgrind --tool=cachegrind ./my-program
Go Memory Profiling
# Generate memory profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Show allocations
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap
Continuous Profiling
For production systems, continuous profiling captures low-overhead profiles over time so you can analyze performance without reproducing issues locally.
Pyroscope (open-source) and Grafana Pyroscope collect profiles from running services and let you compare performance across deployments, find regressions, and correlate profiles with metrics.
Workflow Recommendations
- Start with the question: "Is this endpoint slow?" → Load test it. "What's slow about it?" → Profile it.
- Measure before optimizing: Run benchmarks, save the results, then optimize and compare.
- Profile in production-like conditions: A profiler running locally with 1 user doesn't catch the same issues as 1000 concurrent users.
- Use flamegraphs first: They answer "where does the time go?" faster than any other tool.
- Track performance over time: Add benchmarks to CI. Tools like benchstat and k6 thresholds catch regressions before they ship.