AI spend spikes rarely arrive as one clean root cause. They usually come from a combination of more traffic, worse prompts, retry loops, routing mistakes, or a change in model choice. The fastest way to debug them is to treat them like an incident.
This runbook is for developers and product managers who need to answer three questions quickly:
- What changed?
- Is it still happening?
- What should we stop, fix, or limit first?
Quick answer: what causes most AI spend spikes?
In practice, most spikes come from one of these:
- traffic increased more than expected,
- a fallback or routing change moved requests onto a more expensive model,
- prompt size increased,
- retries or loops multiplied request volume,
- or one feature suddenly became much more active.
That is why the first job is not "optimize costs." The first job is "find the delta."
Step 1: confirm the spike is real
Before you change anything, confirm the increase using the provider's billing or usage source of truth.
- OpenAI: check the usage dashboard or organization usage and costs APIs.
- Anthropic: check the Admin API cost report or usage reports.
- AWS Bedrock: check AWS Cost Explorer and related service usage.
- Vertex AI: check Cloud Billing and any relevant Vertex usage views.
- Azure OpenAI: check Azure Cost Analysis and the Foundry/Azure cost views.
Do not start by looking only at application logs. Start with billing or usage data so you know whether the problem is provider-side, app-side, or both.
Step 2: isolate the time window
Your goal is to answer: when did it start?
Use the narrowest window you can find:
- hourly if the spike is same-day,
- daily if the spike spans several days,
- weekly only if the change is gradual.
This matters because once you know the start window, you can compare it against deploys, feature flags, traffic changes, and routing changes.
Step 3: identify whether the spike is volume, price, or prompt size
Most AI cost incidents reduce to one of three patterns:
This is the most useful debugging split. If you skip it, you waste time optimizing the wrong thing.
Step 4: compare model mix before and after
One of the fastest ways to explain a spike is to look at model distribution:
- Did a feature move from a mid-tier model to a flagship model?
- Did a fallback send a larger share of traffic to an expensive provider?
- Did a "safe default" silently become a costly default?
Even small routing changes matter. A feature using a stronger model for 10% of traffic can be fine. The same model at 100% of traffic can rewrite your monthly forecast in a day.
Step 5: check prompt and output size
If request counts stayed flat, look at token volume per request.
For example:
- a new system prompt may have doubled every request,
- a retrieval layer may now be injecting too many chunks,
- a code generation flow may be producing far longer outputs than before,
- or a cached prompt may have stopped hitting cache.
For OpenAI and Anthropic in particular, token-level usage is often the fastest way to explain why the bill moved before traffic did.
Step 6: check for retries, loops, and background jobs
This is the most common engineering failure mode.
Look for:
- repeated requests from one workflow,
- workers retrying non-idempotent calls,
- queue jobs replaying after partial failure,
- agents calling tools or models recursively,
- or rate-limit handling that multiplies requests instead of backing off cleanly.
If your spike happened suddenly and the request shape looks repetitive, suspect automation before you suspect user growth.
Step 7: identify the top contributor by feature, team, or customer
Your investigation should end with a ranked list, not a vague answer.
Try to isolate:
- top provider
- top model
- top feature or endpoint
- top workspace, team, or customer
If you cannot do that today, the incident is also telling you that your attribution is too weak. See how to attribute AI costs by feature, team, and customer.
Step 8: decide whether to stop, limit, or fix
Once you know the driver, choose the fastest correct response:
- Stop the workload if it is clearly erroneous or runaway.
- Limit it if the workload is legitimate but temporarily too expensive.
- Fix it if the cost increase is caused by prompt bloat, model routing, or bad retry logic.
The common mistake is trying to optimize before containing the problem.
What should you do in the first 30 minutes?
Use this checklist:
- Confirm the spike in provider billing or usage data.
- Find the exact start window.
- Compare request volume, token volume, and model mix.
- Check recent deploys, routing changes, and feature flag changes.
- Identify the biggest single contributor.
- Apply a stopgap if spend is still climbing.
That process is faster than opening dashboards at random.
How should product managers help?
PMs usually have context engineering does not:
- which launches happened,
- which customers were onboarded,
- which experiments were enabled,
- and which usage pattern changes were expected.
PM input is especially useful when the spend spike is real but not actually a bug. Sometimes the answer is "the feature worked." In that case, the next question is whether pricing, routing, or limits need to change.
How do you avoid the next spike?
After the incident, add:
- provider-level daily alerts,
- model-level visibility,
- feature-level attribution,
- and monthly forecast monitoring.
If you only fix the bug and do not improve the instrumentation, the next spike will take the same amount of time to debug.
Bottom line
Investigating an AI spend spike is mostly a delta-analysis exercise:
- confirm the spike,
- find when it started,
- separate volume, model mix, and token size,
- identify the top contributor,
- contain the issue before optimizing it.
The fastest teams treat AI cost problems like production incidents. That mindset usually cuts debugging time in half.
FAQ
What is the first metric I should check?
Check billed or metered usage in the provider source of truth first, then compare request volume, token volume, and model mix.
How do I know whether the spike is a bug or real product growth?
Look for a matching change in traffic, launch activity, or customer usage. If cost grew faster than usage, the problem is often routing, prompt size, or retries.
What if the spike is in Bedrock, Vertex AI, or Azure OpenAI rather than direct API usage?
Use the cloud billing tools first. Those services often show up inside wider cloud spend rather than as a standalone vendor bill.
How fast should I act?
If spend is still rising, contain first and optimize second. Treat active cost spikes like incidents.
What if I cannot identify the top feature or customer?
That means your attribution layer needs work. Add provider, model, feature, team, and customer tagging or grouping before the next incident.
Should I switch to a cheaper model immediately?
Only if you know it will preserve the required quality. In incidents, safer first moves are rate-limiting, pausing a workflow, or reverting a bad routing change.