Use this when your RAG system is technically working but still misses the right evidence too often.
The short answer: treat retrieval as a funnel, not one magic embedding query. A strong default is lexical plus dense candidate generation, optional query rewriting, then reranking before the final prompt.
What you will get in 11 minutes
- A practical retrieval funnel for most production RAG systems
- When BM25, dense retrieval, and rerankers each help
- Which metrics to watch before touching generation prompts
- A simple worksheet for tuning candidate recall vs token cost
Use this when
- Users say “the answer was plausible but ignored the right document”
- Dense retrieval alone is not finding exact names, codes, or rare terms
- Your model sees too many irrelevant chunks
- You want better answer quality without immediately changing the generation model
The 60-second answer
| Retrieval step | Job |
| --- | --- |
| Query rewriting | clarify the user's intent or decompose complex asks |
| BM25 / lexical search | catch exact terms, identifiers, and rare phrases |
| Dense retrieval | catch semantic similarity and paraphrases |
| Reranking | reorder candidates so the best evidence reaches the prompt |
| Generation | answer only after the evidence set is stronger |
If your retrieval stage is weak, changing the generation model usually just makes the wrong answer sound better.
Why dense retrieval alone is not enough
Dense retrieval is good at semantic similarity, but many real production queries also depend on:
- exact product names
- error codes
- contract terms
- part numbers
- date qualifiers
BM25 or another lexical layer helps recover those cases. Dense retrieval helps with paraphrases and concept matching. Together they are usually stronger than either alone.
A strong default retrieval funnel
1. Normalize or rewrite the query
Use a light rewrite step when the original query is vague, noisy, or multi-part.
Good uses:
- turning “what changed in enterprise billing last quarter?” into a narrower retrieval query
- separating a compound question into two sub-queries
- expanding synonyms for internal naming
Bad uses:
- rewriting so aggressively that you erase the user's real wording
- using a rewrite when simple filters would work
2. Generate candidates from two channels
Use both:
- lexical search for exact matches
- dense retrieval for semantic matches
Then merge the candidates before reranking.
This is the highest-leverage change for many systems because it improves recall before you pay for a larger prompt.
3. Rerank before generation
Reranking is the “last filter” between candidate recall and prompt cost.
Instead of sending the top 10 or 20 raw candidates to the model, rerank them and keep only the top few chunks that are actually relevant.
Benefits:
- better prompt precision
- lower token spend
- fewer contradictory chunks
What Anthropic's contextual retrieval work changed
Anthropic's Contextual Retrieval research is useful because it frames retrieval as more than chunking and embeddings. Their published results showed that adding context to chunks before embedding and combining that with lexical search and reranking materially reduced retrieval failures.
The practical lesson is not “copy one exact pipeline.” It is:
- enrich chunks when raw chunks lose meaning
- combine lexical and semantic retrieval
- rerank before generation whenever candidate sets are noisy
When query rewriting helps
Query rewriting helps most when the user query is:
- underspecified
- conversational
- multi-step
- domain-specific but inconsistent in phrasing
Examples:
- “what are the payment limits?” becomes a policy and geography-aware query
- “how do we handle vendor onboarding and approvals?” becomes two retrieval intents
Do not use rewriting as a bandage for a broken index. Fix metadata, chunking, and candidate generation first.
Which metrics matter most
Measure retrieval separately from generation.
Retrieval metrics
- Recall@k
- Hit@k
- MRR or nDCG
- Reranker lift over raw candidates
Answer metrics
- Citation correctness
- Groundedness
- Human-rated usefulness
- Unsupported-claim rate
If you only measure the final answer, you cannot tell whether the failure came from retrieval, reranking, or generation.
Retrieval cost shape
A stronger retrieval funnel often lowers end-to-end cost even if it adds one more step.
Why:
- better recall reduces useless retries
- reranking reduces prompt bloat
- fewer irrelevant chunks reach the generator
That means the cheapest RAG architecture is often not the simplest one. It is the one that keeps token waste low while preserving evidence quality.
Copyable retrieval funnel worksheet
For one production workflow, fill in:
- Query types: exact-match, semantic, or mixed?
- Corpus size and update cadence?
- Metadata filters available?
- Candidate sources: lexical, dense, or both?
- How many candidates are merged before reranking?
- How many chunks reach generation?
- Which retrieval metric will decide whether the change worked?
Common failure modes
- sending raw candidates directly to generation
- optimizing chunk size without measuring recall
- over-rewriting queries
- evaluating only final answer quality
- using a reranker but never checking whether it actually improves ranking
How StackSpend helps
Retrieval work changes both inference cost and infrastructure cost. Tracking spend by workflow and category helps you see whether better reranking reduced token volume, whether embedding or retrieval APIs are growing, and whether a retrieval upgrade actually improved economics after deployment.