Code Search at Scale: Sourcegraph, OpenGrok, and Alternatives
Code Search at Scale: Sourcegraph, OpenGrok, and Alternatives
At some point, grep -r stops being enough. Maybe your codebase spans 50 repositories. Maybe you need to find every caller of a deprecated function across all of them. Maybe you are onboarding onto a new team and need to understand how a library is actually used, not just how its README says it should be used.
Code search tools index your entire codebase and let you query it instantly -- regex, literal, structural, or semantic. The difference between grepping locally and having a proper code search system is the difference between searching one bookshelf and searching the entire library. Once you have it, you will wonder how you worked without it.
Why grep Is Not Enough
Let's be clear: for a single repository, grep and its modern replacements (ripgrep, ag) are excellent. Ripgrep searches a large codebase in milliseconds. If you are working in one repo and know roughly where to look, you do not need anything else.
Code search tools solve different problems:
- Cross-repository search: Find every usage of
parseJSONacross 200 microservice repos. - Historical search: Search across all branches, tags, and commit history.
- Structural search: Find
if err != nil { return err }patterns regardless of whitespace or variable names. - Code intelligence: Jump to definition, find references, and understand type hierarchies across repo boundaries.
- Persistent queries: Save searches, share them with teammates, link to specific results.
- Non-local search: Search code you have not cloned. Search code you do not even have access to clone (in a shared Sourcegraph instance).
If your organization has more than 10 repositories, or more than a few hundred thousand lines of code, a dedicated code search tool pays for itself immediately.
Sourcegraph
Sourcegraph is the most capable code search tool available. It indexes your repositories (GitHub, GitLab, Bitbucket, or any Git host), provides instant regex and structural search, and layers on code intelligence features like go-to-definition and find-references that work across repositories.
Getting Started
Sourcegraph can run as a single Docker container for evaluation:
docker run --publish 7080:7080 \
--publish 127.0.0.1:3370:3370 \
--volume sourcegraph-data:/var/opt/sourcegraph \
sourcegraph/server:5.3.0
Connect it to your code host:
- Navigate to
http://localhost:7080 - Go to Site Admin > Repositories > Manage code hosts
- Add your GitHub/GitLab connection with an access token
- Sourcegraph clones and indexes your repositories
For production, Sourcegraph offers a Kubernetes deployment (self-hosted) and a fully managed cloud version.
Search Syntax
Sourcegraph's search language is powerful. A few examples:
# Literal search across all repos
parseJSON
# Regex search
repo:myorg/.* file:\.go$ func\s+Handle\w+
# Search specific repos
repo:^github\.com/myorg/api-server$ TODO
# Filter by language
lang:typescript useEffect
# Search in specific file paths
file:src/middleware/ auth
# Exclude patterns
-file:test -file:vendor lang:go error
# Search commit messages
type:commit fix auth
# Search diffs (what changed)
type:diff repo:myorg/api-server parseJSON
Structural Search
This is where Sourcegraph genuinely differentiates itself. Structural search understands code syntax -- it matches balanced brackets, respects string boundaries, and ignores whitespace differences.
# Find all try-catch blocks that catch and ignore errors
lang:typescript try { :[body] } catch (:[_]) { }
# Find React useState with specific patterns
lang:typescript const [:[state], :[setter]] = useState(:[init])
# Find Go error handling that wraps errors
lang:go if err != nil { return fmt.Errorf(:[msg], err) }
# Find Python functions with more than 3 parameters
lang:python def :[name](:[p1], :[p2], :[p3], :[rest])
The :[name] syntax matches any code fragment. :[_] matches but discards. This lets you find patterns that regex cannot express without becoming unreadable.
Code Intelligence
Sourcegraph provides IDE-like navigation in your browser:
- Go to definition: Click a function call, jump to where it is defined -- even if it is in a different repository.
- Find references: See every caller of a function across all indexed repositories.
- Hover documentation: See type signatures and docstrings inline.
Code intelligence works through two mechanisms: search-based (heuristic, works for any language) and precise (based on SCIP indexers that produce compiler-grade accuracy). Precise code intelligence requires running an indexer as part of your CI pipeline:
# For TypeScript
npm install -g @sourcegraph/scip-typescript
scip-typescript index --output index.scip
src code-intel upload -file=index.scip
# For Go
go install github.com/sourcegraph/scip-go/cmd/scip-go@latest
scip-go
src code-intel upload -file=index.scip
Batch Changes
Sourcegraph's batch changes feature lets you make automated code modifications across many repositories. Define a change, preview it, and create pull requests across dozens or hundreds of repos at once:
# batch-change.yaml
name: update-logging-library
description: Replace log.Printf with slog.Info across all Go services
on:
- repositoriesMatchingQuery: lang:go log.Printf repo:myorg
steps:
- run: comby 'log.Printf(:[args])' 'slog.Info(:[args])' -matcher .go -in-place
container: comby/comby
changesetTemplate:
title: "Migrate from log.Printf to slog.Info"
body: "Automated migration to structured logging."
branch: migrate-to-slog
commit:
message: "refactor: migrate from log.Printf to slog.Info"
src batch preview -f batch-change.yaml
src batch apply -f batch-change.yaml
This is incredibly powerful for large-scale refactoring -- updating a deprecated API, fixing a security pattern, or enforcing a new coding standard across an entire organization.
Sourcegraph Pricing and Deployment
Sourcegraph offers a free tier for up to 1 user (Sourcegraph Cody Free), a self-hosted community edition, and paid enterprise tiers. The self-hosted option is what most teams evaluate first. Be aware that Sourcegraph's pricing has shifted significantly over the years -- check current pricing carefully and evaluate whether the features you need are in the free or paid tier.
OpenGrok
OpenGrok is the old guard of code search. Developed by Oracle (originally Sun Microsystems), it is a Java-based source code search and cross-reference engine. It is open source, mature, and still widely used in large organizations -- especially those with legacy codebases in C, C++, and Java.
Setup
OpenGrok runs as a Java web application:
# Docker is the easiest path
docker run -d \
--name opengrok \
-p 8080:8080 \
-v /path/to/your/source:/opengrok/src \
-v opengrok-data:/opengrok/data \
opengrok/docker:latest
OpenGrok indexes the source files in /opengrok/src and serves a web interface on port 8080. It supports dozens of languages through ctags-based analysis.
Strengths and Limitations
Strengths: Free and open source. Handles very large codebases (millions of lines). Supports cross-referencing (click a symbol to see its definition). Mature and battle-tested in enterprise environments. Low resource requirements compared to Sourcegraph.
Limitations: The UI feels dated -- it is functional but not modern. No structural search. No code intelligence beyond ctags-based cross-referencing. No batch changes or automated refactoring. Setup and indexing configuration can be fiddly. Community development has slowed considerably.
OpenGrok is a solid choice if you need free, self-hosted code search with cross-referencing and do not need the advanced features of Sourcegraph. It is not the right choice if you want a modern developer experience.
Hound
Hound is a fast code search tool built by Etsy. It is lightweight, easy to deploy, and focused on one thing: regex search across repositories.
Setup
# Clone and build
git clone https://github.com/hound-search/hound.git
cd hound
go build ./cmd/houndd
# Create a config file
cat > config.json << 'EOF'
{
"max-concurrent-indexers": 4,
"repos": {
"api-server": {
"url": "https://github.com/myorg/api-server.git"
},
"web-client": {
"url": "https://github.com/myorg/web-client.git"
}
}
}
EOF
# Run it
./houndd --conf config.json
Hound clones the repositories, indexes them with its own trigram index, and serves a clean web UI on port 6080. Searches are fast -- typically under 100 milliseconds even across many repositories.
Strengths and Limitations
Strengths: Extremely simple to deploy and operate. Fast regex search. Clean, responsive UI. Low resource usage. Good enough for many small-to-medium organizations.
Limitations: No code intelligence (no go-to-definition or find-references). No structural search. No search history or saved queries. No integration with code review or CI. The project receives sporadic maintenance.
Hound is the right tool when you want cross-repo search and nothing else. It fills the gap between "grep in one repo" and "full Sourcegraph deployment" with minimal operational overhead.
livegrep
livegrep takes a different approach to code search. Instead of building an index and searching it, livegrep keeps the entire codebase in memory and searches it in real time using a suffix array. The result is absurdly fast regex search -- sub-10-millisecond responses on multi-gigabyte codebases.
How It Works
livegrep was built at Stripe for internal use and later open-sourced. It consists of two components: a backend (codesearch) that holds the index in memory, and a web frontend.
# Build from source
git clone https://github.com/livegrep/livegrep.git
cd livegrep
bazel build //...
# Create index config
cat > index.json << 'EOF'
{
"name": "myorg",
"repositories": [
{
"name": "api-server",
"path": "/path/to/api-server",
"revisions": ["HEAD"]
}
]
}
EOF
# Build index and run
bazel-bin/src/tools/codesearch -index_only -index index.json
bazel-bin/src/tools/codesearch -load_index index.json -listen grpc://localhost:9999
bazel-bin/cmd/livegrep-server/livegrep-server -listen :8910 -connect localhost:9999
Strengths and Limitations
Strengths: Blazingly fast. Regex search feels instant. The in-memory approach means there is no stale index -- results are always current (after re-indexing). Good for organizations that prioritize search speed above all else.
Limitations: Memory hungry -- the entire codebase (plus index) must fit in RAM. The build system uses Bazel, which is non-trivial to set up. No code intelligence. The web UI is minimal. Documentation is sparse. Fewer active maintainers than other options.
Choosing the Right Tool
For Small Teams (1-20 developers, < 50 repos)
Start with ripgrep locally and Hound for cross-repo search. Hound takes 10 minutes to set up and covers the most common use case: "where is this function used across our services?"
For Medium Teams (20-100 developers, 50-500 repos)
Sourcegraph's self-hosted deployment is worth the setup cost. Cross-repo code intelligence and structural search become genuinely valuable at this scale. The ability to search commit history and diffs saves hours of archaeology.
For Large Organizations (100+ developers, 500+ repos)
Sourcegraph (self-hosted or cloud) is the standard answer. Batch changes alone justify the cost when you need to make coordinated changes across hundreds of repositories. OpenGrok is an alternative if budget is a hard constraint and you can live without modern features.
For Speed-Obsessed Teams
livegrep if you have the memory budget and the Bazel patience. Nothing else matches its raw search speed.
Search-Driven Development Workflows
Beyond finding code, search tools enable workflows that are impossible with local grep:
Dependency impact analysis: Before changing a library's API, search for all callers across all consuming services. Know exactly what will break before you break it.
Security auditing: Search for known-vulnerable patterns (eval(, dangerouslySetInnerHTML, sql.Raw() across your entire codebase. Run these searches on a schedule.
Onboarding: New team members can search for how a pattern is used in practice, not just how the docs say to use it. "Show me every file that configures the database connection" is a faster onboarding tool than any wiki page.
Code review context: When reviewing a PR that changes a shared utility, search for all other usages to understand the full blast radius.
Standards enforcement: Search for deprecated patterns and track their removal over time. "How many files still use the old logging library?" becomes a dashboard metric.
Local Search Tools Worth Knowing
Even with a centralized code search system, you spend most of your time searching locally. These tools are worth having in your toolkit:
ripgrep (rg): The fastest general-purpose search tool. Respects .gitignore by default, supports PCRE2 regex, and handles Unicode correctly. This should be your default grep replacement.
# Search for a pattern, respecting .gitignore
rg "parseJSON" --type ts
# Search with context
rg "TODO|FIXME|HACK" -C 2
# Search and replace (preview)
rg "oldFunction" --replace "newFunction" --passthru
ast-grep: Structural search for your terminal. Like Sourcegraph's structural search but local.
# Find all console.log calls
ast-grep --pattern 'console.log($$$)' --lang ts
# Find useState hooks
ast-grep --pattern 'const [$A, $B] = useState($C)' --lang tsx
GitHub code search: GitHub's built-in code search has improved significantly. It supports regex, language filters, and path filters across all repositories you have access to. For quick searches when you do not have Sourcegraph, it is surprisingly capable.
The tools in this space range from "five-minute setup" to "dedicated infrastructure team." Pick the simplest tool that solves your actual problem, and upgrade when the pain of not having better search outweighs the cost of running it.