Back to topic hub
Guides
March 11, 2026
By Andrew Day

Hybrid search and reranking patterns for RAG

Improve RAG quality by treating retrieval as a funnel: lexical search, dense retrieval, reranking, and only then generation.

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Use this when your RAG system is technically working but still misses the right evidence too often.

The short answer: treat retrieval as a funnel, not one magic embedding query. A strong default is lexical plus dense candidate generation, optional query rewriting, then reranking before the final prompt.

What you will get in 11 minutes

  • A practical retrieval funnel for most production RAG systems
  • When BM25, dense retrieval, and rerankers each help
  • Which metrics to watch before touching generation prompts
  • A simple worksheet for tuning candidate recall vs token cost

Use this when

  • Users say “the answer was plausible but ignored the right document”
  • Dense retrieval alone is not finding exact names, codes, or rare terms
  • Your model sees too many irrelevant chunks
  • You want better answer quality without immediately changing the generation model

The 60-second answer

| Retrieval step | Job |
| --- | --- |
| Query rewriting | clarify the user's intent or decompose complex asks |
| BM25 / lexical search | catch exact terms, identifiers, and rare phrases |
| Dense retrieval | catch semantic similarity and paraphrases |
| Reranking | reorder candidates so the best evidence reaches the prompt |
| Generation | answer only after the evidence set is stronger |

If your retrieval stage is weak, changing the generation model usually just makes the wrong answer sound better.

Why dense retrieval alone is not enough

Dense retrieval is good at semantic similarity, but many real production queries also depend on:

  • exact product names
  • error codes
  • contract terms
  • part numbers
  • date qualifiers

BM25 or another lexical layer helps recover those cases. Dense retrieval helps with paraphrases and concept matching. Together they are usually stronger than either alone.

A strong default retrieval funnel

1. Normalize or rewrite the query

Use a light rewrite step when the original query is vague, noisy, or multi-part.

Good uses:

  • turning “what changed in enterprise billing last quarter?” into a narrower retrieval query
  • separating a compound question into two sub-queries
  • expanding synonyms for internal naming

Bad uses:

  • rewriting so aggressively that you erase the user's real wording
  • using a rewrite when simple filters would work

2. Generate candidates from two channels

Use both:

  • lexical search for exact matches
  • dense retrieval for semantic matches

Then merge the candidates before reranking.

This is the highest-leverage change for many systems because it improves recall before you pay for a larger prompt.

3. Rerank before generation

Reranking is the “last filter” between candidate recall and prompt cost.

Instead of sending the top 10 or 20 raw candidates to the model, rerank them and keep only the top few chunks that are actually relevant.

Benefits:

  • better prompt precision
  • lower token spend
  • fewer contradictory chunks

What Anthropic's contextual retrieval work changed

Anthropic's Contextual Retrieval research is useful because it frames retrieval as more than chunking and embeddings. Their published results showed that adding context to chunks before embedding and combining that with lexical search and reranking materially reduced retrieval failures.

The practical lesson is not “copy one exact pipeline.” It is:

  • enrich chunks when raw chunks lose meaning
  • combine lexical and semantic retrieval
  • rerank before generation whenever candidate sets are noisy

When query rewriting helps

Query rewriting helps most when the user query is:

  • underspecified
  • conversational
  • multi-step
  • domain-specific but inconsistent in phrasing

Examples:

  • “what are the payment limits?” becomes a policy and geography-aware query
  • “how do we handle vendor onboarding and approvals?” becomes two retrieval intents

Do not use rewriting as a bandage for a broken index. Fix metadata, chunking, and candidate generation first.

Which metrics matter most

Measure retrieval separately from generation.

Retrieval metrics

  • Recall@k
  • Hit@k
  • MRR or nDCG
  • Reranker lift over raw candidates

Answer metrics

  • Citation correctness
  • Groundedness
  • Human-rated usefulness
  • Unsupported-claim rate

If you only measure the final answer, you cannot tell whether the failure came from retrieval, reranking, or generation.

Retrieval cost shape

A stronger retrieval funnel often lowers end-to-end cost even if it adds one more step.

Why:

  • better recall reduces useless retries
  • reranking reduces prompt bloat
  • fewer irrelevant chunks reach the generator

That means the cheapest RAG architecture is often not the simplest one. It is the one that keeps token waste low while preserving evidence quality.

Copyable retrieval funnel worksheet

For one production workflow, fill in:

  1. Query types: exact-match, semantic, or mixed?
  2. Corpus size and update cadence?
  3. Metadata filters available?
  4. Candidate sources: lexical, dense, or both?
  5. How many candidates are merged before reranking?
  6. How many chunks reach generation?
  7. Which retrieval metric will decide whether the change worked?

Common failure modes

  • sending raw candidates directly to generation
  • optimizing chunk size without measuring recall
  • over-rewriting queries
  • evaluating only final answer quality
  • using a reranker but never checking whether it actually improves ranking

How StackSpend helps

Retrieval work changes both inference cost and infrastructure cost. Tracking spend by workflow and category helps you see whether better reranking reduced token volume, whether embedding or retrieval APIs are growing, and whether a retrieval upgrade actually improved economics after deployment.

What to do next

Continue in Academy

Build production LLM applications

Choose the right LLM pattern for structured data, retrieval, agents, chat, multimodal workflows, and ML-adjacent systems.

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Know where your cloud and AI spend stands — every day, starting today.

Sign up