Use this when you're deciding whether to retrieve with embeddings or stuff more content into the prompt.
The fast answer: for small knowledge bases (<200K tokens), full-context with prompt caching often wins. For larger corpora or frequently changing data, retrieval with embeddings scales better.
What you will get in 10 minutes
- Cost shape of embeddings + retrieval
- Cost shape of long-context prompting
- Decision rules by corpus size and update cadence
- A checklist to pick the right approach for your use case
Use this when
- You're building RAG or document-Q&A and unsure whether to retrieve or prompt with more context
- Long-context models are available and you're tempted to skip retrieval
- Retrieval or inference costs are growing and you want to understand the tradeoff
- You're choosing between embedding providers and wondering if retrieval is worth the infra
The 60-second answer
| Your situation | Prefer |
| --- | --- |
| Corpus under ~200K tokens (~500 pages) | Full-context + prompt caching |
| Corpus over 200K tokens | Embeddings + retrieval (RAG) |
| Same documents reused across many requests | Full-context (caching pays off fast) |
| Documents change weekly or daily | Retrieval (re-embed is cheaper than re-cache) |
| Queries need only a small slice of the corpus | Retrieval (targeted context beats full dump) |
| Budget tight, need to ship in days | Full-context first if corpus fits |
Anthropic recommends full-context with caching for knowledge bases under 200K tokens before building retrieval. See long-context AI pricing above 200K for provider thresholds.
Cost shape: Embeddings + retrieval
Costs come from embedding, retrieval infra, and inference with retrieved chunks.
Embedding cost
- One-time per document change (or per ingestion run)
- Voyage AI: ~$0.02–$0.06 per million tokens (pricing); 200M free tokens/month
- OpenAI text-embedding-3-small/large: per-token pricing
- Contextual retrieval (add context before embedding): ~$1.02 per million document tokens one-time with Claude Haiku + caching (Anthropic)
Retrieval infra
- Vector DB (Pinecone, Weaviate, pgvector, etc.): roughly $350–$2,850/month managed, or engineering cost if self-hosted
- Reranking (Cohere, Voyage): adds ~$0.02–$0.05 per million tokens
Inference
- Per request: system prompt + retrieved chunks (typically 2–5K tokens) + user query + output
- Retrieval keeps input smaller than full-context because you only send the top-K chunks
- Example: 3K retrieved tokens at Claude Haiku 4.5 ($1/MTok input) ≈ $0.003 per request in context cost
When retrieval wins: Large corpus, selective queries (only a slice needed), documents change often, or you need to scale beyond what fits in context.
Cost shape: Full-context prompting
Costs come from input tokens (and output). Prompt caching changes the math.
Without caching
- You pay full input price for the entire corpus on every request
- 200K tokens at Claude Sonnet 4.6 ($3/MTok) ≈ $0.60 per request
- At scale this dominates; retrieval sends only 2–5K tokens per request
With prompt caching
- Cache write: 1.25x–2x base input price (one-time per cache window)
- Cache read: ~10% of base input price (Anthropic pricing)
- After one or two cache hits, cached content is much cheaper than re-processing
- Same corpus reused across many requests → caching pays off quickly
When full-context wins: Corpus fits in context (<200K tokens), same content reused often, and you can use a provider with caching (Anthropic, others).
Decision rules
Rule 1: Corpus size
| Corpus size | Recommendation |
| --- | --- |
| < 100K tokens | Full-context + caching. Skip retrieval. |
| 100K–200K tokens | Try full-context first. If latency or cost is high, move to retrieval. |
| > 200K tokens | Retrieval. Full-context will hit pricing tiers and/or fail. |
Rule 2: Cache reuse
- High reuse (same docs, many similar queries) → full-context + caching
- Low reuse (each query touches different docs) → retrieval
- Mixed → consider hybrid: cache a shared prefix, retrieve the rest
Rule 3: Update cadence
- Documents change rarely → full-context is fine; re-cache on change
- Documents change weekly/daily → retrieval; re-embed is cheaper than re-cache at scale
Rule 4: Query selectivity
- Queries need a small slice of the corpus → retrieval wins (targeted 2–5K vs full 200K)
- Queries often need most of the corpus → full-context can work if it fits
Checklist by use case
| Use case | Full-context | Retrieval |
| --- | --- | --- |
| Internal wiki Q&A, <100 docs | ✓ | |
| Customer support KB, 500+ articles | | ✓ |
| Codebase search, 10K+ files | | ✓ |
| Legal/contract search, 1000s of docs | | ✓ |
| Chat over a few PDFs, same each session | ✓ | |
| Product docs, updated weekly | | ✓ |
How to see which is costing you
Category-level analysis helps. If retrieval is a big cost driver, you'll see spend on embedding APIs and possibly a separate infra line. If inference dominates, check whether long context or retrieved context is the cause. AI cost monitoring surfaces spend by provider, model, and category so you can see whether embeddings, inference, or infra is moving the needle.