Benchmarking individual tools is useful. But AI agents don’t use tools individually — they use them in sequences. The real question isn’t “how efficient is search_text?” It’s “how efficient is the workflow?”
Methodology
We extracted tool call sequences from 188 real Claude Code agent sessions using n-gram analysis on usage logs. From hundreds of unique 2-step, 3-step, and 4-step sequences, we selected the 13 most frequent — each occurring at least 12 times across multiple sessions.
Each sequence was then benchmarked:
- 772 total runs across 33 real-world TypeScript/React/NestJS codebases (50 to 4,100+ files each)
- Both approaches used the same queries and evaluated on token consumption
- Native baseline: the equivalent Bash workflow (grep, find, read) that an agent without CodeSift would use
- Win = CodeSift consumed fewer tokens than native for the same task
Results summary
| Metric | Value |
|---|---|
| Total native tokens | 5,130,240 |
| Total CodeSift tokens | 1,994,825 |
| Aggregate reduction | −59% |
| Win rate | 542 / 772 (70%) |
| Sequences tested | 13 |
| Codebases used | 33 |
All 13 sequences
Tier 1: High reduction (−80% to −86%)
These sequences combine structured tools (symbol search, pattern matching) with text search. The structured tool identifies what’s relevant, text search finds how it’s used — without reading full files.
| Sequence | Description | Runs | Reduction | Win rate |
|---|---|---|---|---|
ss→st | Symbol discovery → usage search | 65 | −86% | 63% |
pat→st→pat→st | Extended pattern investigation | 37 | −86% | 68% |
pat→st→pat | Pattern-first investigation loop | 39 | −85% | 77% |
st→ss | Text orient → symbol narrow | 58 | −84% | 67% |
st→pat→st→pat | Text-first pattern investigation | 35 | −84% | 66% |
st→ss→st | Text → symbol → text refinement | 27 | −81% | 67% |
st→pat→st | Pattern sandwich — text bookended by patterns | 40 | −80% | 63% |
Why these win big: Symbol search (search_symbols) with detail_level="compact" returns ~15 tokens per result. Pattern search (search_patterns) uses AST matching, returning only structural matches. Both avoid dumping raw file content into the context window.
Tier 2: Good reduction (−68% to −76%)
File tree + search combinations. The agent maps the codebase structure first, then searches within it.
| Sequence | Description | Runs | Reduction | Win rate |
|---|---|---|---|---|
st→tree→st | Search → check structure → search again | 27 | −76% | 89% |
tree→st | Map the codebase → search within it | 50 | −68% | 86% |
Why these are the most reliable: st→tree→st has the highest win rate of any sequence at 89%. The file tree tool (get_file_tree with compact=true) returns a flat path list with symbol counts at ~10x less output than find -type f. The agent learns the project layout cheaply, then targets its search.
Tier 3: Moderate reduction (−26% to −39%)
Retrieval-heavy sequences. codebase_retrieval batches multiple query types (text, semantic, symbols) into one call — powerful, but the batched output is denser than individual tools.
| Sequence | Description | Runs | Reduction | Win rate |
|---|---|---|---|---|
st→cr | Text search → batch follow-up queries | 91 | −39% | 81% |
st→cr→st | Investigative loop: search, batch, refine | 41 | −39% | 83% |
cr→st | Batch query first → targeted follow-up | 81 | −32% | 79% |
cr→st→cr→st | Exploratory investigation (both expensive) | 12 | −26% | 58% |
Why these save less: codebase_retrieval returns structured results from multiple query types in one response. That’s already more efficient than 3-5 separate calls, but the response size is larger than a single targeted tool. The −26% outlier (cr→st→cr→st) represents truly exploratory investigation where the agent doesn’t know what it’s looking for — both approaches are expensive.
Real-world examples
Example 1: Symbol discovery workflow (ss→st)
An agent investigating a function in an i18n platform:
search_symbols("getLanguageName", detail_level="compact")— finds the function definition in 3 tokenssearch_text("getLanguageName", file_pattern="*.service.ts")— finds all usage sites
Native approach: grep -rn "function getLanguageName" + grep -rn "getLanguageName" --include="*.service.ts" — returns full matching lines with surrounding context.
| Approach | Tokens |
|---|---|
| Native (grep) | 6,386 |
| CodeSift | 3 |
| Reduction | −99.9% |
Example 2: File exploration workflow (tree→st)
An agent onboarding to a content management system:
get_file_tree("src/", compact=true)— gets the full directory structure in a flat listsearch_text("article-generat")— finds the article generation logic
Native approach: find src/ -type f + grep -rn "article-generat" src/
| Approach | Tokens |
|---|---|
| Native (find + grep) | 5,852 |
| CodeSift | 2 |
| Reduction | −99.9% |
Example 3: Anti-pattern hunt (pat→st→pat)
An agent running a code quality check on a survey platform:
search_patterns("toBeDefined")— finds all weak assertions in testssearch_text(".toBeDefined(")— gets surrounding context for each matchsearch_patterns("toBeDefined")— refined pattern search after context
Native approach: grep -rn "toBeDefined" --include="*.test.ts" × 3 with manual filtering
| Approach | Tokens |
|---|---|
| Native (grep) | 928 |
| CodeSift | 0 |
| Reduction | −100% |
(Zero tokens because the pattern search returned empty — no matches in this codebase. CodeSift correctly returns nothing. Native grep still outputs headers, paths, and line numbers even with zero matches.)
Example 4: Investigative workflow (st→cr)
An agent searching for permission logic in an enterprise API:
search_text("CAN_MANAGE_ORGS")— finds the permission constant usagecodebase_retrieval(queries=[{type:"references", symbol_name:"CAN_MANAGE_ORGS"}, {type:"semantic", query:"organization management permissions"}])— batches follow-up
| Approach | Tokens |
|---|---|
| Native (grep + multiple reads) | 8,911 |
| CodeSift | 63 |
| Reduction | −99.3% |
The pattern that emerges
The sequences with the highest reduction share a structure: structured tools bookend text searches. The agent uses symbol search or pattern matching to identify what’s relevant, then text search to find how it’s used. This avoids the biggest token waste in native workflows: reading full file contents to extract a few lines of information.
| Pattern | Avg reduction | Why it works |
|---|---|---|
| Symbol + text | −83% | Compact symbol lookup (~15 tok/result) replaces grep’s full-line output |
| Pattern + text | −83% | AST matching returns structural hits, not string matches |
| Tree + text | −72% | Flat path list replaces find + wc -l combos |
| Retrieval + text | −34% | Batch queries save round-trips but responses are denser |
The honest outlier
cr→st→cr→st only achieves −26% with a 58% win rate. This pattern represents exploratory investigation — the agent doesn’t yet know what it’s looking for. CodeSift’s advantage is largest when agents have structured intent. When truly exploring, both approaches are expensive.
In the 26% of runs where native won, the margin was small and the task was exploratory. In the 74% where CodeSift won, margins of 80–99% were common.
All benchmark data collected 2026-03-30 from 188 real agent sessions. Scripts available in the CodeSift repository.