Use this when you're choosing how to add knowledge or behavior to an LLM—RAG, fine-tuning, or both.
The fast answer: it's no longer binary. Three options matter—RAG, fine-tuning, and full-context for small knowledge bases. Hybrids are common: put volatile knowledge in retrieval, put stable behavior in fine-tuning.
What you will get in 12 minutes
- A decision framework (data volatility, size, update cadence)
- Cost models for RAG vs fine-tuning vs full-context
- When each approach wins
- Hybrid patterns that teams actually ship
- A simple worksheet to pick an approach
Use this when
- You're building an AI feature that needs proprietary knowledge or specific behavior
- The team is debating RAG vs fine-tuning
- You want to understand cost before committing to an architecture
- You're scaling an existing RAG or fine-tuned system and wondering if you chose right
The 60-second answer
| Your situation | Best first option |
| --- | --- |
| Data changes weekly or daily | RAG |
| You need the model to learn a new skill or behavior | Fine-tuning |
| You need factual accuracy from documents | RAG |
| You need consistent tone, style, or format | Fine-tuning |
| Budget tight, need production in < 6 weeks | RAG (almost always) |
| Knowledge base under ~200K tokens (~500 pages) | Full-context + prompt caching first |
If your knowledge base is smaller than 200,000 tokens, Anthropic recommends including the entire corpus in the prompt with prompt caching before building RAG. Caching can cut repeated-prompt costs by up to 90%.
Cost model: RAG
RAG costs come from three places: embedding, retrieval infra, and inference.
Embedding cost (one-time per doc change)
- Voyage AI (Anthropic-recommended): ~$0.02–$0.06 per million tokens (voyage-4-lite to voyage-4)
- Contextual retrieval (improves accuracy): ~$1.02 per million document tokens one-time, using Claude Haiku + prompt caching (Anthropic)
- 200M free tokens/month on Voyage for main models
Retrieval infra
- Managed vector DB + retrieval: roughly $350–$2,850/month depending on scale
- Or self-host with Pinecone, Weaviate, etc.—add engineering and ops cost
Inference
- Higher than fine-tuning: every request includes retrieved chunks (typically 2–5K tokens) in context
- Claude Haiku 4.5: ~$1/MTok input, $5/MTok output
- Claude Sonnet 4.6: ~$3/MTok input, $15/MTok output
When RAG wins: Frequently updated knowledge, factual accuracy from documents, budget-conscious and need to ship quickly.
Cost model: Fine-tuning
Fine-tuning costs come from training and inference.
Training (OpenAI, verified March 2026)
| Model | Training per 1M tokens | Inference input | Inference output |
| --- | --- | --- | --- |
| GPT-4.1 | $25 | $3 | $12 |
| GPT-4.1-mini | $5 | $0.80 | $3.20 |
| GPT-4.1-nano | $1.50 | $0.20 | $0.80 |
- Fine-tuning uses GPT-4.1 family. GPT-5.4 and GPT-5 Mini are inference-only as of early 2026.
- Reinforcement fine-tuning (o4-mini): $100 per training hour; inference $4 input / $16 output.
- OpenAI recommends starting with ~50 well-crafted examples and scaling from there.
Maintenance
- Fine-tuning needs data pipelines and full retrains when base models upgrade
- RAG: re-embed changed docs (minutes–hours)
- Fine-tuning: re-train (hours–days)
When fine-tuning wins: Stable knowledge, consistent tone/style/format, high query volume (100K+/month) where per-query savings compound, or tasks requiring new skills or domain-specific logic.
Third option: Full-context for small KBs
For knowledge bases under ~200K tokens:
- Stuff the whole corpus into the prompt
- Use prompt caching (Anthropic) or equivalent
- Skip retrieval entirely
Caching makes repeated long prompts much cheaper—cache reads cost ~10% of base input price. No vector DB, no embedding pipeline.
When full-context wins: Fewer than ~100 documents, total under ~200K tokens, same knowledge base reused across many requests.
RAG quality improvements: Contextual Retrieval
Traditional chunking loses context—a chunk like "revenue grew 3%" without company or quarter is hard to retrieve. Anthropic's Contextual Retrieval adds chunk-specific context before embedding:
- Contextual Embeddings alone: 35% reduction in top-20 retrieval failures
- With Contextual BM25: 49% reduction
- With reranking: 67% reduction
Implementation: Anthropic cookbook. Voyage and Cohere rerankers work well with it.
Hybrid patterns
| Pattern | Use case |
| --- | --- |
| RAG + fine-tuned generator | RAG retrieves facts; fine-tuned model generates in brand voice. Customer support with citations + consistent tone. |
| Router + specialists | Router classifies: general → base LLM, creative/style → fine-tuned, factual → RAG. Mixed query types, single entry point. |
| Fine-tuned embeddings + RAG | Fine-tune embedding model on domain; standard LLM for generation. Legal, medical, or other domain-specific retrieval. |
Decision worksheet
- Data volatility: Does your knowledge change weekly or daily? → RAG
- Behavior vs knowledge: Do you need new skills/behavior, or access to documents? → Fine-tuning for behavior, RAG for knowledge
- Corpus size: Under 200K tokens? → Try full-context + caching first
- Budget and timeline: Tight budget, need production in < 6 weeks? → RAG
- Query volume: 100K+ requests/month with stable prompts? → Fine-tuning may pay off
Cost comparison snapshot
| Factor | RAG | Fine-tuning |
| --- | --- | --- |
| Initial setup | 2–6 weeks | 4–8 weeks |
| RAG infra | $350–$2,850/mo (managed) | — |
| Fine-tuning first run | — | $2,400–$18,000 (example: 1M tokens × 2 epochs × GPT-4.1-mini) |
| Monthly maintenance | Re-embed changed docs (minutes–hours) | Data pipelines + retrains (hours–days) |
| Inference cost | Higher (retrieval context per request) | Lower (no retrieval overhead) |
Figures are approximate; validate against your workload. Source: PE Collective.