Name: CodeSift
Author: CodeSift

Keyword search answers one question: “does this string appear in the code?” That’s powerful. But it misses an entire class of query.

“How does authentication work?” has no single keyword. “What’s the caching strategy?” could match a dozen implementations. “Why did we add rate limiting here?” can’t be answered by any text search at all.

CodeSift’s semantic search answers these queries using embeddings.

Three embedding providers

Env Variable	Provider	Model	Notes
`CODESIFT_VOYAGE_API_KEY`	Voyage AI	`voyage-code-3`	Best quality for code
`CODESIFT_OPENAI_API_KEY`	OpenAI	`text-embedding-3-small`	~$0.02/1M tokens
`CODESIFT_OLLAMA_URL`	Ollama (local)	`nomic-embed-text`	Free, runs locally

Three search modes

Semantic — pure embedding similarity. Best for concept queries.

{ "type": "semantic", "query": "error handling and retry logic", "top_k": 10 }

Hybrid — semantic + BM25 merged via Reciprocal Rank Fusion (RRF, k=60). Best for most real queries.

{ "type": "hybrid", "query": "caching strategy", "top_k": 10 }

Conversation — search past AI session history by concept.

{ "type": "conversation", "query": "why we chose Redis over Postgres cache" }

Benchmark results

On a 4,127-file TypeScript codebase, 10 conceptual questions rated on a 1-10 scale:

CodeSift: 7.8/10 average quality
Native (grep-based): 6.5/10 average quality
Improvement: +20%

When to use semantic vs keyword

Query Type	Best Mode
”Find function named X”	`search_symbols` (keyword)
“Find all TODO comments”	`search_text` (keyword)
“How does authentication work?”	`assemble_context` + semantic
”What’s our caching strategy?”	`codebase_retrieval` hybrid
”Why did we add this middleware?”	`search_conversations`
”Find code similar to this pattern”	`codebase_retrieval` semantic

When Keyword Search Isn't Enough

Three embedding providers

Three search modes

Benchmark results

When to use semantic vs keyword