Use this when you know there is waste but are unsure what to change first.
The fast answer: rank tactics by savings potential and effort, then pick two for the next sprint. Start with prompt compression and model tier before touching caching or batching.
What you will get in 12 minutes
- A prioritization framework (savings vs effort)
- A ranked list of six tactics with tradeoffs
- A "do this this sprint" block
- A simple way to measure whether a change worked
Use this when
- Your AI bill is higher than you want
- You have visibility into cost drivers but no optimization plan
- The team asks "what should we do first?"
- You are about to scale usage and want to avoid waste from the start
Prioritization framework
Not all tactics are equal. Use two dimensions:
| Dimension | What it means |
| --- | --- |
| Savings potential | How much cost reduction is realistic for your workload? |
| Effort | Engineering time, testing, rollout risk |
High savings and low effort wins first. Low savings and high effort goes to the backlog.
Tactic 1: Prompt compression
Savings potential: High for context-heavy workflows
Effort: Low to medium
Long prompts cost more. Trim system prompts, reduce retrieved context, and remove redundant instructions.
| Action | Example |
| --- | --- |
| Shorten system prompts | Cut from 500 to 150 tokens where clarity allows |
| Limit RAG context | Retrieve top 3 instead of top 10 when quality holds |
| Remove duplicate instructions | Consolidate repeated rules |
| Use structured prompts | Templates reduce token bloat |
Tradeoff: Too aggressive compression can hurt quality. Test on a sample before rollout.
Do this sprint: Identify your top 3 workflows by token cost. Shorten the system prompt for the largest one and A/B test.
Tactic 2: Smaller model fallback
Savings potential: Very high for the right tasks
Effort: Medium
Many tasks do not need flagship models. Use switching to cheaper AI models without losing quality for the full process.
| Action | Example |
| --- | --- |
| Route by task type | Classification → GPT-5 Mini, complex reasoning → GPT-5.4 |
| Use fallback chains | Try cheaper model first, escalate on failure |
| Evaluate by workflow | One workflow may save 10x, another may need premium |
Tradeoff: Quality risk if task taxonomy is wrong. Evaluate per workflow.
Do this sprint: List workflows that are classification, extraction, or simple summarization. Run an evaluation on one with GPT-5 Mini or Haiku.
Tactic 3: Caching
Savings potential: High for repeated inputs
Effort: Medium to high
Cache embeddings and completion responses for identical or near-identical inputs.
| Action | Example |
| --- | --- |
| Embedding cache | Same document, same embedding — cache it |
| Semantic cache | Similar queries → reuse similar responses when safe |
| Prompt cache (providers that support it) | Reuse long prefix across requests |
Tradeoff: Cache invalidation and hit-rate tuning add complexity.
Do this sprint: Measure how many requests have identical or near-identical inputs. If more than 10 percent, caching is worth exploring.
Tactic 4: Batching
Savings potential: Medium for bulk workloads
Effort: Medium
Batch non-real-time requests to reduce overhead and sometimes unlock cheaper pricing.
| Action | Example |
| --- | --- |
| Batch embeddings | Process 100 documents in one API call instead of 100 |
| Batch classification | Run overnight batch instead of real-time where possible |
| Use batch APIs where available | Some providers offer batch endpoints with lower effective cost |
Tradeoff: Latency. Only for workloads that can wait.
Do this sprint: Identify background or batch workloads. Check whether they are already batched. If not, add batching for the largest one.
Tactic 5: Retrieval optimization
Savings potential: Medium to high for RAG systems
Effort: Medium
Over-retrieval burns tokens. Retrieve less, rank better, and filter earlier. Before building RAG, check whether RAG vs fine-tuning cost tradeoffs suggests full-context or fine-tuning instead.
| Action | Example |
| --- | --- |
| Reduce retrieved chunks | Top 3 instead of top 10 when quality holds |
| Improve chunking | Smaller, more focused chunks reduce noise |
| Pre-filter before retrieval | Exclude irrelevant documents earlier |
| Use cheaper embeddings for pre-filter | Two-stage: cheap embedding filter, then expensive retrieval |
Tradeoff: Recall can drop if you over-optimize. Measure retrieval quality.
Do this sprint: For your main RAG workflow, try reducing from top-K to top-(K-3) and measure quality. If it holds, you save tokens. For small corpora, check embeddings vs full context cost efficiency—you may be able to skip retrieval entirely.
Tactic 6: Response truncation and structure
Savings potential: Medium for long outputs
Effort: Low
Shorter outputs cost less. Use max_tokens, structured output, and truncation where possible.
| Action | Example |
| --- | --- |
| Set max_tokens | Cap completions when you know the expected length |
| Use JSON mode | Structured output reduces verbose prose |
| Truncate summaries | "Summarize in 2 sentences" instead of open-ended |
Tradeoff: User experience if outputs feel cut off. Test.
Do this sprint: Check your top 3 workflows for max_tokens. Set or tighten where safe.
How to measure whether a change worked
Before and after:
- Pick one dimension: cost per request, cost per feature, or cost per user.
- Establish a baseline for the week before the change.
- Roll out the change to a subset if possible.
- Compare the same metric for the week after.
- If cost drops and quality holds, roll out fully.
StackSpend helps by showing spend by provider, model, feature, and category. You can compare before and after optimization work in one view. See AI cost monitoring.
Tactic ranking matrix
| Tactic | Savings | Effort | Do first if |
| --- | --- | --- | --- |
| Prompt compression | High | Low–med | Context-heavy workflows |
| Smaller model fallback | Very high | Medium | You have Tier 1 tasks |
| Caching | High | Med–high | Repeated inputs common |
| Batching | Medium | Medium | Bulk or background workloads |
| Retrieval optimization | Med–high | Medium | RAG is a big cost driver |
| Response truncation | Medium | Low | Long outputs are common |