Regex and Text Processing Tools for Developers
Regex and Text Processing Tools for Developers
Text processing is the unglamorous core of developer work. Parsing logs, reshaping JSON, extracting fields from CSV, cleaning up data before import -- you do it constantly and the right tool turns a 30-minute script into a 10-second one-liner.
This guide covers the tools that matter, the patterns you'll reach for most often, and clear recommendations for which tool to use when.
Regex Testing: regex101
regex101.com is the best regex testing tool. It supports PCRE2, Python, JavaScript, Go, Java, and .NET flavors. The features that make it indispensable:
- Real-time match highlighting as you type the pattern
- Explanation panel that breaks down your regex into plain English
- Substitution testing so you can verify replacements before running them
- Saved patterns with shareable permalinks (great for code reviews)
When you're building anything beyond a simple pattern, open regex101 first. Write and verify the regex there, then paste it into your code. RegExr (regexr.com) is a decent alternative with a community pattern library, and Debuggex generates railroad diagrams for visualizing complex patterns. But regex101 handles every flavor and the explanation panel alone makes it the default.
jq: JSON Processing on the Command Line
jq is the single most useful text processing tool for modern development. APIs return JSON, config files are JSON, logs are often JSON.
# Pretty-print JSON
curl -s https://api.example.com/data | jq .
# Extract fields (nested or not)
jq '.name' data.json
jq '.config.database.host' settings.json
# Iterate over arrays and extract fields
jq '.users[] | .name' data.json
# Filter array elements
jq '.events[] | select(.type == "error")' logs.json
# Build new objects from existing data
jq '.users[] | {name: .name, email: .contact.email}' data.json
# Count, sort, unique
jq '.results | length' response.json
jq '[.logs[].level] | unique' app.json
jq '.items | sort_by(.date) | reverse' data.json
Real-World jq Recipes
# Parse AWS CLI output -- get running instance IDs
aws ec2 describe-instances | jq -r \
'.Reservations[].Instances[] | select(.State.Name == "running") | .InstanceId'
# Convert JSON array to CSV
jq -r '.users[] | [.name, .email, .role] | @csv' users.json
# Merge two JSON files
jq -s '.[0] * .[1]' base.json overrides.json
# Group and count by field
jq 'group_by(.status) | map({status: .[0].status, count: length})' orders.json
The -r flag (raw output) strips quotes from strings -- essential when piping jq output to other commands.
yq: jq but for YAML
yq applies jq-like syntax to YAML. Essential if you work with Kubernetes manifests, GitHub Actions workflows, or Docker Compose files.
There are two tools called yq -- Mike Farah's Go version and a Python wrapper. The Go version is the one you want. It's faster, standalone, and more actively maintained.
brew install yq # Mike Farah's Go version
yq '.metadata.name' deployment.yaml # Read a field
yq -i '.spec.replicas = 3' deployment.yaml # Update in-place
yq -o=json eval '.' config.yaml # Convert YAML to JSON
yq eval-all 'select(fileIndex == 0) * select(fileIndex == 1)' base.yaml overlay.yaml # Merge
sed: Practical Patterns Only
sed has a reputation for being cryptic, but used for what it's good at -- find-and-replace across files -- it's straightforward.
# Replace all occurrences (in-place)
sed -i 's/old/new/g' file.txt
sed -i '' 's/old/new/g' file.txt # macOS (requires empty backup suffix)
# Delete lines matching a pattern
sed '/^#/d' config.txt # Remove comment lines
sed '/^$/d' file.txt # Remove blank lines
# Replace on specific lines
sed '10,20s/old/new/g' file.txt # Lines 10-20 only
# Multiple operations
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Replace across all TypeScript files (fd + sed)
fd -e ts --exec sed -i 's/oldFunction/newFunction/g'
When to skip sed: If your replacement involves complex logic, conditionals, or multi-line patterns, switch to awk or a real script. Fighting sed to do things it wasn't designed for wastes time.
awk: One-Liners That Are Actually Useful
awk processes text line by line, splitting each line into fields ($1, $2, etc.) separated by whitespace. You only need 5% of the language.
# Print specific columns
awk '{print $1, $3}' data.txt
# Custom field separator
awk -F',' '{print $1, $2}' data.csv
awk -F':' '{print $1}' /etc/passwd
# Filter by condition
awk '$3 > 100 {print $1, $3}' sales.txt
# Sum a column
awk '{sum += $2} END {print sum}' numbers.txt
# Count matches
awk '/ERROR/ {count++} END {print count}' app.log
# Print unique values in a column (deduplicate)
awk '!seen[$1]++ {print $1}' data.txt
# Print lines longer than 80 characters
awk 'length > 80' code.txt
# Print last field on each line (useful for paths)
awk -F'/' '{print $NF}' paths.txt
# Summarize HTTP status codes from access log
awk '{print $9}' access.log | sort | uniq -c | sort -rn
xsv and csvkit: CSV Done Right
Parsing CSV with awk seems easy until you hit quoted fields containing commas. Use a proper CSV tool.
xsv (Rust, fast) handles large files and basic operations:
xsv table data.csv | head -20 # Aligned column view
xsv select name,email users.csv # Select columns
xsv search -s status "active" users.csv # Filter rows
xsv sort -s revenue -R sales.csv # Sort descending
xsv stats data.csv | xsv table # Column statistics
xsv join id users.csv user_id orders.csv # Join on shared column
csvkit (Python) adds format conversion and SQL:
in2csv data.xlsx > data.csv # Excel to CSV
csvsql --query "SELECT name, SUM(amount) FROM orders GROUP BY name" orders.csv
csvlook data.csv # Pretty-print
Use xsv for speed. Use csvkit when you need SQL queries or format conversion.
Miller (mlr): Format-Aware Data Processing
Miller handles CSV, TSV, JSON, and other structured formats with a single tool. It's what awk would be if awk understood data formats natively.
mlr --icsv --ojson cat data.csv # CSV to JSON
mlr --ijson --ocsv cat data.json # JSON to CSV
mlr --csv filter '$age > 30' people.csv # Filter records
mlr --csv put '$total = $price * $quantity' orders.csv # Computed fields
mlr --csv stats1 -a sum -f revenue -g region sales.csv # Group-by aggregation
mlr --csv sort-by -nr revenue data.csv # Sort descending
Miller's real power is chaining operations and converting between formats in a single pipeline.
fx and gron: JSON Exploration
When you don't know the structure of a JSON blob and need to explore it:
fx gives you an interactive terminal UI. Arrow keys expand and collapse nodes -- great for unfamiliar API responses.
curl -s https://api.example.com/data | fx
gron flattens JSON into discrete assignments, making it greppable:
gron data.json
# json.name = "test";
# json.items[0].id = 1;
gron data.json | grep "name" # Find fields by name
gron data.json | grep "items" | gron --ungron # Unflatten back to JSON
gron answers "where in this giant JSON blob is the field I'm looking for?" Flatten, grep, done.
Patterns You'll Actually Use
These come up weekly in real development work.
Log Analysis
# Count errors by type
grep -oP 'ERROR: \K\w+' app.log | sort | uniq -c | sort -rn
# Busiest hour in access logs
awk '{print $4}' access.log | cut -d: -f2 | sort | uniq -c | sort -rn
# All unique IP addresses
grep -oP '\d+\.\d+\.\d+\.\d+' access.log | sort -u
# Tail JSON logs with pretty-printing
tail -f app.log | jq .
Data Transformation
# JSON to CSV
jq -r '.records[] | [.id, .name, .email] | @csv' data.json > output.csv
# CSV to JSON
mlr --icsv --ojson cat data.csv > data.json
# TSV to CSV
mlr --itsv --ocsv cat data.tsv > data.csv
Quick Data Inspection
head -5 data.csv | column -t -s',' # Peek at structure
awk -F',' '{print $3}' data.csv | sort -u | wc -l # Unique values in col 3
awk -F',' '{print $3}' data.csv | sort | uniq -c | sort -rn | head -10 # Most common
Tool Recommendations by Use Case
| Task | Best Tool | Runner-Up |
|---|---|---|
| Test/debug a regex | regex101 | RegExr |
| Parse JSON from APIs | jq | fx (for exploration) |
| Edit YAML config files | yq | sed (simple replacements) |
| Find-and-replace across files | sed + fd | ripgrep --replace |
| Columnar text processing | awk | cut (trivial cases) |
| CSV operations | xsv | csvkit (SQL or conversions) |
| Format conversion (CSV/JSON/TSV) | miller | jq + csvkit |
| Explore unknown JSON | gron | fx |
| Log analysis | awk + grep | jq (JSON logs) |
The Bottom Line
Start with jq -- it covers the most common modern use case and the skills transfer to yq for YAML. Add sed and awk patterns to your muscle memory for general text wrangling. Pick up xsv or miller when you're doing serious CSV or data format work.
The key insight is knowing which tool to reach for. Don't write a Python script to extract a field from JSON when jq '.field' does it. Don't fight sed into doing multi-line transformations when awk handles it cleanly. Match the tool to the task and you'll spend less time processing text and more time on the work that actually matters.