When an AI agent navigates a codebase, the real cost is not just “can it find the answer?” but how many calls, how many tokens, and how much noise it has to process before it gets there.
That is the lens we used for these benchmarks.
We tested CodeSift across real TypeScript codebases — a CLI tool (382 files), an i18n platform (1,200+ files), and a full-stack app (4,127 files) — and compared each tool against the closest practical workflow an agent would use without CodeSift. In some cases that baseline is a direct shell equivalent such as rg, find, or reading a file. In other cases, especially for AST-aware, graph-aware, or LSP-backed tools, the baseline is not a single command but a multi-step agent flow built from grep, file reads, and git commands.
That distinction matters.
Some CodeSift tools are straightforward optimizations over raw shell output. Others provide capabilities that native shell tooling simply does not expose directly. In those cases, the benchmark should be read as a comparison of practical agent workflows, not as a claim that raw grep is “wrong.”
What we measured
For each tool, we defined the closest realistic native agent workflow:
search_text→ native:rgget_file_outline→ native: read the full filesearch_symbols→ native: regex-based grep for likely definitions with contextassemble_context→ native: grep for relevant files, then read several of themfind_dead_code→ native: export names, then grep per symbol (21 calls)trace_route→ native: grep route strings, inspect handlers, follow service callsscan_secrets→ native: multiple grep passes with secret-like patterns
Single-tool results
| Tool | Native Tokens | CodeSift Tokens | Reduction |
|---|---|---|---|
search_text | ~16,000 | ~5,700 | −65% |
search_symbols | ~57,000 | ~5,700 | −90% |
get_file_outline | ~2,300 | ~420 | −82% |
search_patterns | ~21,000 | ~2,500 | −88% |
codebase_retrieval | ~40,000 | ~9,200 | −77% |
get_symbol | ~40,000 | ~3,600 | −91% |
assemble_context | ~93,000 | ~12,600 | −86% |
find_dead_code | ~29,600 | ~5,400 | −82% |
get_knowledge_map | ~43,700 | ~4,400 | −90% |
analyze_complexity | ~25,100 | ~660 | −97% |
find_clones | ~39,300 | ~3,600 | −91% |
trace_route | ~35,000 | ~61 | −99% |
scan_secrets | ~1,637,000 | ~11,500 | −99% |
cross_repo_search | ~16,900 | ~800 | −95% |
Combo flow results
We analyzed 188 real agent sessions, extracted the 13 most common tool sequences via n-gram analysis, and benchmarked each. Across 603 runs:
- Native total tokens: 4,584,153
- CodeSift total tokens: 1,860,130
- Aggregate reduction: −59%
- Win rate: 447/603 runs (74%)
Tools with no native equivalent
These CodeSift tools solve problems that grep fundamentally cannot:
| Tool | What it does |
|---|---|
detect_communities | Louvain graph clustering on import topology |
frequency_analysis | AST shape clustering — structural repetition |
search_conversations | Semantic search over past AI sessions |
check_boundaries | Architecture rule enforcement |
classify_roles | Hub/Bridge/Leaf/Sink classification |
go_to_definition | LSP-precise definition lookup |
get_type_info | Hover-based type information |
rename_symbol | Cross-file type-safe rename |
ast_query | Tree-sitter structural search |
semantic_search | Embedding-based code search by meaning |
get_call_hierarchy | LSP call hierarchy: incoming + outgoing |
find_circular_deps | Import cycle detection via DFS |
find_unused_imports | Dead import detection per file |
review_diff | 9 parallel static checks on a git diff |
These aren’t optimizations. They’re new capabilities.
What the numbers mean
These benchmarks primarily measure agent-facing output tokens and number of calls required per workflow. They do not always prove that two tools are semantically identical. For graph, AST, and LSP-backed tools, CodeSift is doing richer work than the native baseline. In those cases, the benchmark should be interpreted as “how expensive is the nearest native workflow?” — not “are these two tools identical under every edge case?”
All benchmark data collected 2026-03-30. See full 64-tool reference.