Keyword search answers one question: “does this string appear in the code?” That’s powerful. But it misses an entire class of query.
“How does authentication work?” has no single keyword. “What’s the caching strategy?” could match a dozen implementations. “Why did we add rate limiting here?” can’t be answered by any text search at all.
CodeSift’s semantic search answers these queries using embeddings.
Three embedding providers
| Env Variable | Provider | Model | Notes |
|---|---|---|---|
CODESIFT_VOYAGE_API_KEY | Voyage AI | voyage-code-3 | Best quality for code |
CODESIFT_OPENAI_API_KEY | OpenAI | text-embedding-3-small | ~$0.02/1M tokens |
CODESIFT_OLLAMA_URL | Ollama (local) | nomic-embed-text | Free, runs locally |
Three search modes
Semantic — pure embedding similarity. Best for concept queries.
{ "type": "semantic", "query": "error handling and retry logic", "top_k": 10 }
Hybrid — semantic + BM25 merged via Reciprocal Rank Fusion (RRF, k=60). Best for most real queries.
{ "type": "hybrid", "query": "caching strategy", "top_k": 10 }
Conversation — search past AI session history by concept.
{ "type": "conversation", "query": "why we chose Redis over Postgres cache" }
Benchmark results
On a 4,127-file TypeScript codebase, 10 conceptual questions rated on a 1-10 scale:
- CodeSift: 7.8/10 average quality
- Native (grep-based): 6.5/10 average quality
- Improvement: +20%
When to use semantic vs keyword
| Query Type | Best Mode |
|---|---|
| ”Find function named X” | search_symbols (keyword) |
| “Find all TODO comments” | search_text (keyword) |
| “How does authentication work?” | assemble_context + semantic |
| ”What’s our caching strategy?” | codebase_retrieval hybrid |
| ”Why did we add this middleware?” | search_conversations |
| ”Find code similar to this pattern” | codebase_retrieval semantic |