Vibe coded and written by Gemini and Opus with design and architectural decisions by Oskar Austegard
The Problem with Code Search
When working with unfamiliar codebases, there's a fundamental tension: you need to find relevant code quickly, but you also need enough context to understand what you've found. Traditional approaches fall into two camps:
- Structure-first tools (like LSP, ctags, or AST indexes) give you perfect context but require expensive pre-processing and struggle with partial queries
- Text-search tools (like grep, ripgrep) are blazingly fast but return fragmented results that break at arbitrary line boundaries
This tension becomes critical for agentic coding, where an AI needs to rapidly explore codebases it's never seen before.
What is Tree-sitter?
Before diving into the solution, let's clarify the key technology: tree-sitter is a parser generator that builds concrete syntax trees (CSTs) for source code. Unlike traditional parsers, it's designed to be:
- Fast: Parses files in milliseconds, even for large codebases
- Incremental: Can update the tree as code changes, not reparse everything
- Error-tolerant: Works on incomplete or syntactically invalid code
- Language-agnostic: Supports 40+ languages with the same API
Most importantly, tree-sitter understands code structure. It knows the difference between:
- A function definition vs. a function call
- A class declaration vs. a string containing "class"
- A docstring vs. a random comment
This structural awareness is what grep fundamentally lacks.
The Journey: From Mapping to Exploring
mapping-codebases: The Structure-First Approach
The original mapping-codebases skill took the structure-first path. It would parse entire directory trees using tree-sitter, extract all exports/imports, and generate comprehensive _MAP.md files showing the skeleton of each module:
## classes/User.py
### Classes
- `User` (line 15)
- `__init__(self, name, email)` (line 16)
- `validate(self)` (line 23)
- `save(self)` (line 31)
Strengths:
- Complete structural overview
- Accurate function signatures
- Hierarchical organization
Limitations:
- Requires upfront processing of entire codebase
- Map files become stale as code changes
- Can't answer "where is X used?" without reading implementations
- Token-heavy for large projects
GrepRAG: The Speed Revelation
Then came the GrepRAG paper, which benchmarked various code retrieval methods for LLM context. Their key finding: ripgrep-based retrieval was 17× faster than graph-based methods (0.40s vs 7s) while maintaining comparable accuracy.
However, they also identified two critical failures of pure text search:
-
Context fragmentation: grep matches arbitrary lines, often cutting off function signatures or return statements. You might get line 45 of a function without knowing what parameters it accepts.
-
Keyword ambiguity: Searching for
configfloods results with log messages, comments, and string literals alongside actualConfigclass definitions.
The paper attempted to fix these with statistical heuristics (line-number clustering, identifier weighting), but these are fundamentally band-aids on a text-only approach.
The Hybrid Solution: Speed + Structure + Efficiency
The new exploring-codebases skill combines three key insights:
- Use ripgrep as the dragnet (Phase 1: speed)
- Use tree-sitter as the scalpel (Phase 2: structure)
- Default to signatures, expand on demand (Phase 3: efficiency)
Phase 1: The Dragnet (ripgrep)
rg --json "class Session" /path/to/repo
Goal: Cast a wide net quickly. Find every file and line where the search term appears.
Speed: ~0.02s for large repositories (as GrepRAG demonstrated)
Output: List of (file, line_number) tuples
Phase 2: The Scalpel (tree-sitter)
For each ripgrep match, use tree-sitter to:
- Parse only that specific file (lazy, on-demand)
- Find the smallest enclosing AST node (function/class definition)
- Extract the complete, syntactically valid code block
- Deduplicate if multiple matches fall within the same node
Example: If ripgrep finds Session on line 356, tree-sitter identifies it's inside a class definition spanning lines 356-816.
Phase 3: Progressive Disclosure (The Token Multiplier)
Here's the critical enhancement over v0.1: don't dump 460 lines of implementation when 20 lines of signature will do.
Default output (signatures only):
class Session(SessionRedirectMixin):
"""A Requests session.
Provides cookie persistence, connection-pooling, and configuration.
"""
...
Token cost: ~50 tokens (vs 4000 for full implementation)
When you need details, expand:
search.py "class Session" /path/to/repo --expand-full
Returns the complete 460-line implementation.
Why This Matters
1. Context Fragmentation → Fixed
Problem (grep alone): Searching for get returns line 595:
return self.request("GET", url, **kwargs)
No parameters, no docstring, no context.
Solution (hybrid with signatures):
def get(self, url, **kwargs):
r"""Sends a GET request. Returns :class:`Response` object.
:param url: URL for the new :class:`Request` object.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:rtype: requests.Response
"""
...
You get the complete API surface without the implementation noise.
2. Keyword Ambiguity → Fixed
Tree-sitter knows the difference between:
class Config:(definition - keep)print("config loaded")(string literal - skip)# TODO: add config(comment - skip)
This filtering is deterministic, not statistical. You don't need BM25 re-ranking to boost likely definitions; the AST tells you what's a definition.
3. Token Efficiency → 10-20× Improvement
Progressive disclosure exploits a key insight: most code exploration is hierarchical.
Typical workflow:
- "What classes are in this module?" → 10 tokens per class signature
- "What methods does Session have?" → 5 tokens per method signature
- "How is
send()implemented?" → 200 tokens for full method
Old approach (v0.1):
- Query:
"class Session" - Result: 4,000 tokens of full implementation
- Mental overhead: Scan 460 lines to find method signatures
New approach (v0.2):
- Query:
"class Session" - Result: 50 tokens (signature with docstring)
- Follow-up:
"def send" --expand-full→ 200 tokens for just that method - Total: 250 tokens vs 4,000 tokens (16× more efficient)
4. Real-World Example
Testing on the requests library:
# Find what Session offers
$ search.py "class Session" requests/
Found 2 matches for 'class Session':
### requests/sessions.py
class Session(SessionRedirectMixin):
"""A Requests session.
Provides cookie persistence, connection-pooling, and configuration.
"""
...
Tokens used: ~80
Now you know Session exists and what it does. Need to know its methods?
# Scan for methods (still signature-only)
$ search.py "def " requests/sessions.py | grep "class.*Session" -A 20
Or expand a specific method:
$ search.py "def request" requests/sessions.py --expand-full
Progressive approach: ~300 tokens total to understand the class and one method
Old approach: 4,000 tokens dumped upfront, most unused
Performance Characteristics
The hybrid approach exploits code's sparse structure:
- Index-free: No preprocessing required. Works on any repo immediately.
- O(1) search: Ripgrep finds matches near-instantly (~0.02s)
- Selective parsing: Tree-sitter only parses the 5-10 files ripgrep found, not all 10,000 files in the repo
- Selective expansion: Signatures by default, full bodies on demand
- Token-efficient: 10-20× reduction in token usage for typical exploration
Comparison table:
| Approach | Preprocessing | Search Time | Tokens per Match | Updates |
|---|---|---|---|---|
| LSP/ctags | Minutes | Instant | Variable | Manual rebuild |
| mapping-codebases | Minutes | Instant | 50-100 | Stale maps |
| grep/ripgrep | None | 0.02s | 5-10 (fragmented) | Real-time |
| GrepRAG | None | 0.40s | 50-200 (heuristic) | Real-time |
| exploring-codebases v0.1 | None | 0.05s | 500-5000 (full) | Real-time |
| exploring-codebases v0.2 | None | 0.05s | 50 (sig) / 500+ (full) | Real-time |
Implementation Notes
The enhancement adds ~100 lines to the original 280-line script:
def _extract_signature(self, node, source_bytes, language):
"""Extract just the declaration + docstring, exclude body."""
if language == 'python':
return self._extract_python_signature(node, source_bytes)
# ... other languages
For Python:
- Extracts everything before the function/class
blocknode - Includes docstrings (first
stringchild of the block) - Replaces body with
...
Example CST traversal:
class_definition
├── class (keyword)
├── identifier (name)
├── argument_list (bases)
├── : (colon)
└── block (body) → extract first string child (docstring), then stop
├── string (docstring) ← include this
├── function_definition ← skip
└── ... ← skip
Usage Patterns
Quick exploration (default):
# What classes exist?
search.py "class " src/
# What methods does User have?
search.py "def " src/models/user.py
# Find all validation methods
search.py "def validate" src/
Deep dive (selective expansion):
# Get full implementation of specific method
search.py "def validate_email" src/ --expand-full
# Get full class implementation
search.py "class User" src/ --expand-full
Language-specific:
# Only Python files
search.py "class Config" . --glob "*.py"
# Only TypeScript
search.py "interface User" . --glob "*.ts"
When to Use What
Use mapping-codebases when:
- You need a comprehensive structural overview
- Working on a single project long-term
- Want persistent documentation of module organization
- Need import/export graph
Use exploring-codebases when:
- Exploring unfamiliar codebases
- Looking for specific implementations or patterns
- Need immediate answers without preprocessing
- Working across multiple repositories
- Want minimal token overhead
Use grep/ripgrep directly when:
- Searching for literal strings or patterns
- Need regex capabilities
- Don't care about structural boundaries
Addressing GrepRAG's Concerns
The paper identified these pain points with pure grep:
| GrepRAG Concern | exploring-codebases Solution |
|---|---|
| "Disrupts logical flow" (fragments code) | Tree-sitter ensures complete function/class nodes |
| "Usage before definition" (line ordering) | AST traversal maintains semantic relationships |
| "Keyword noise" (comments, strings) | Deterministic AST filtering, not statistical |
| "Token waste" (returning everything) | Progressive disclosure: signatures first |
| "Expensive re-ranking" (BM25, TF-IDF) | Not needed - structural filtering is deterministic |
GrepRAG tried to fix grep's problems with more grep (line clustering, statistical weighting). We fix them by acknowledging grep's limits and using the right tool for structural awareness.
Conclusion
The GrepRAG paper proved text search is fast enough for real-time code retrieval. But speed without structure wastes tokens, and structure without progressive disclosure wastes more.
By combining:
- Ripgrep as the dragnet (0.02s search)
- Tree-sitter as the scalpel (complete, valid nodes)
- Progressive disclosure as the efficiency multiplier (signatures first)
We get a new paradigm for code exploration: lazy, selective, precise, and token-conscious.
This is particularly valuable for AI agents that need to rapidly orient themselves in unfamiliar codebases. Instead of "here's line 45" or "here's 10,000 lines," they get "here's the signature; ask if you need the body" - exactly the context needed to reason efficiently.
The future of code search is hybrid, structural, and progressive.
Code: https://github.com/oaustegard/claude-skills/tree/main/exploring-codebases
Release zip: https://github.com/oaustegard/claude-skills/releases?q=exploring-codebases&expanded=true