You're building a feature that needs an LLM. You have a dozen options, and each provider page claims high quality, low cost, and fast speed. In practice, the choice is simpler than it looks.
Choose an LLM by ranking three levers in order: quality, latency, and cost. Then test two or three candidates on your own workload. The right model is not the most capable one overall. It is the cheapest model that reliably clears your quality bar at acceptable latency.
Quick answer: which kind of model should you start with?
If you want the short version:
- User-facing chat: start with a fast mid-tier model.
- Batch summarization: start with a cheaper model.
- Code generation or hard reasoning: start with a stronger model, then test cheaper fallbacks.
- RAG or long-context retrieval: watch input pricing first, because those workloads are prompt-heavy.
That gets you 80% of the way there. The rest is evaluation discipline.
What are the three levers?
Cost means what you pay per token or per request. Usage-based pricing compounds quickly at scale, so a model that looks cheap in testing can become expensive in production.
Latency means time to first token and time to completion. In a chat UI, users feel latency immediately. In a nightly batch job, they do not.
Quality means task success: accuracy, formatting, reasoning, code correctness, or instruction-following. It is not the same thing as benchmark rank.
Most teams should choose one primary lever and one secondary lever. If you try to maximize all three at once, you usually overpay and still end up uncertain.
Which lever matters most for your workload?
If latency matters most, use a faster model and accept some quality trade-off. If cost matters most, use the cheapest model that still passes your evals. If quality matters most, start with a stronger model and only step down if your tests say you safely can.
How should you define “good enough”?
Before comparing models, write down what success actually means for your use case.
- Structured output: Does the model need to return valid JSON or follow a schema?
- Domain fit: Is the workload code, support, analysis, legal text, or creative writing?
- Error tolerance: Can you tolerate occasional mistakes, or is a single bad answer expensive?
- Review model: Will a human review outputs, or does the model act directly in the product?
This matters because many teams skip the definition step and end up paying for model quality they do not need.
A practical recommendation
Run an eval set of 50 to 100 representative examples across 2 or 3 candidate models. Score them on:
- Task success
- Format compliance
- Latency
- Cost per task
Pick the cheapest model that clears your minimum acceptable score. That recommendation is falsifiable, repeatable, and easier to defend than "this model felt better."
How do you test latency?
User-facing chat usually wants sub-second time to first token. Batch jobs can tolerate slower responses if quality or cost is better.
- Streaming workloads: prioritize time to first token.
- Non-streaming workflows: prioritize full completion time.
- Long outputs: total runtime matters more than the initial response.
- Cross-region traffic: a distant region can add noticeable delay.
Use model latency benchmarks as a point-in-time reference, then test your own prompts because prompt length and output length materially affect speed.
How do you compare model cost?
Token-based pricing varies widely. As of March 2026:
- Flagship models (GPT-5.2, Claude Opus): ~$5–25 per 1M input tokens, ~$15–170 per 1M output
- Mid-tier (GPT-5 Mini, Claude Sonnet, Gemini Flash): ~$0.25–3 per 1M input, ~$2–15 per 1M output
- Budget (Gemini Flash Lite, smaller open models): ~$0.10–0.50 per 1M tokens
For input-heavy workloads such as RAG, long-context search, or large prompt assembly, input pricing usually dominates. For output-heavy tasks such as summarization or content generation, output pricing matters more. See our AI API pricing guide for point-in-time pricing context.
When should you use more than one model?
You do not need one model forever, and you do not need one model for every task.
Use a router or gateway when:
- chat needs a faster model than batch processing,
- you want automatic fallbacks,
- you want to A/B test cheaper alternatives,
- or you want to route high-value requests to a stronger model.
That extra flexibility adds operational complexity, so it only pays off once model usage is already meaningful. See LLM tooling in 2026 for gateway options.
What does a good model choice look like in practice?
Here are four simple starting points:
- Support chatbot: prioritize latency first, then quality, then cost.
- Internal summarization workflow: prioritize cost first, then quality.
- Developer copilot or code review: prioritize quality first, then latency.
- RAG over long documents: prioritize input pricing and structured-output reliability.
The goal is not to guess correctly on day one. The goal is to make a defensible first choice and improve it with production data.
How should you monitor the decision after launch?
Once you choose a model, track spend by model and workload. Without that, you cannot tell whether a cheaper model would have worked or whether one feature is consuming a disproportionate share of budget.
Track at least:
- Spend by provider
- Spend by model
- Request volume
- Error or fallback rate
- Output quality complaints or review failures
If you use multiple providers, a unified monitoring layer helps you compare the real cost of your decisions. StackSpend tracks OpenAI, Anthropic, Cursor, Hugging Face, and Grok alongside cloud spend in one place.
When should you revisit the model choice?
- Volume growth — 10x more requests = 10x more cost. A model that was fine at low volume may be too expensive at scale.
- New models — Vendors release cheaper or faster models regularly. Re-evaluate quarterly.
- Quality drift — If users complain or evals degrade, consider upgrading.
- Latency complaints — If TTFT is too high, try a smaller model or a different provider.
A simple selection checklist
Before you commit to a model, make sure you can answer yes to these:
- Do we know what “good enough” means for this task?
- Have we tested at least two realistic alternatives?
- Have we measured both latency and cost on real prompts?
- Do we know whether this workload is input-heavy or output-heavy?
- Do we have a way to monitor spend after launch?
If not, you are still choosing on intuition.
Bottom line
- Define the quality bar.
- Test 2 or 3 candidates on your own workload.
- Measure latency on real prompts.
- Compare cost by workload shape, not just by list price.
- Start with the cheapest model that passes.
- Revisit the choice when usage, quality needs, or pricing changes.
That is the most reliable framework for choosing an LLM without overpaying.
FAQ
How do I know if a cheaper model is "good enough"?
Run a representative eval set. If accuracy and format compliance meet your bar, it's good enough. Don't over-invest in quality you don't need.
Should I use one model for everything?
Only if your workloads are similar. Chat, batch, and RAG have different cost/latency/quality profiles. Routing by task usually saves money.
Should I choose the highest benchmark model?
Not by default. Benchmarks are useful, but production fit depends on your prompts, output format, latency tolerance, and budget.
How often should I re-evaluate?
Quarterly for pricing and new models. Immediately if you see quality or latency issues.
What if I use multiple providers (OpenAI + Anthropic)?
Track total spend across providers. Unified visibility prevents one provider's growth from hiding behind another's stability.
What is the most common mistake teams make?
They skip evaluation and choose based on provider brand or social proof. The second-most common mistake is using a strong expensive model everywhere, even for low-stakes tasks.
Do I need a router on day one?
Usually no. Start with one model unless you already know you need fallbacks or task-based routing.