Use this when your team is changing prompts, models, tools, or retrieval settings and wants a safer way to ship.
The short answer: evaluate by task type, not with one generic “quality score.” Build a small gold dataset, define one or two release metrics that matter, and check every meaningful model or prompt change against them.
What you will get in 11 minutes
- A lightweight evaluation loop you can actually run
- Metrics by task type instead of one blended score
- A release-gate pattern for model and prompt changes
- A worksheet for turning production failures into eval cases
Use this when
- You are switching models or changing prompts
- Retrieval quality is under review
- A workflow feeds customers, ops teams, or downstream systems
- Team debates are currently settled by anecdotes
The 60-second answer
An evaluation system needs four parts:
- a clearly defined task
- a representative dataset
- task-specific metrics
- a release threshold
OpenAI's eval guidance frames this well: define the task, run the eval with test inputs, then analyze the results and iterate. That is much closer to behavior testing than to “let's glance at a few outputs.”
Start with the task, not the model
Bad eval design starts with:
- “Which model is best?”
Good eval design starts with:
- “What output must be correct for this workflow to be safe and useful?”
Examples:
- extraction workflow -> field accuracy
- classifier -> precision and recall by class
- support chat -> escalation correctness and resolution quality
- RAG -> retrieval recall and citation correctness
Metrics by task type
Extraction
- field-level precision
- field-level recall
- F1 by field
Classification or routing
- accuracy
- precision and recall by label
- false-positive and false-negative cost
Retrieval or RAG
- Recall@k
- Hit@k
- citation correctness
- unsupported-claim rate
Chat or support workflows
- resolution rate
- escalation accuracy
- containment rate
- human review score
Tool-use workflows
- task completion rate
- tool-call success rate
- recovery success rate
- average step count
If you cannot name the right metric, you probably have not defined the workflow tightly enough.
Build the smallest useful eval set
Do not wait for a massive benchmark.
Start with:
- 25 to 50 representative examples for one workflow
- known-good labels, fields, or rubric scores
- a few hard edge cases
- a few recent real failures
Then grow the dataset as production teaches you where the workflow breaks.
Offline evals vs online signals
You need both.
Offline evals
Good for:
- comparing prompts
- testing model swaps
- checking regression before release
Online signals
Good for:
- monitoring drift after deployment
- tracking review rate
- seeing whether latency or cost changed
- catching new failure modes from real traffic
Offline evals tell you whether to ship. Online signals tell you whether the shipped system is staying healthy.
A practical release-gate pattern
For one workflow:
- define one primary metric and one guardrail metric
- set a minimum threshold
- compare candidate vs baseline
- reject the change if the primary metric drops or the guardrail worsens too far
Example:
- primary metric: extraction F1 must not drop below baseline
- guardrail: review rate must not increase more than 5 points
This prevents “cheaper but worse” rollouts from slipping through because someone liked two sample outputs.
Turn failures into eval cases
Every repeated production failure should become one of:
- a permanent regression case
- a new edge-case bucket
- a new routing or escalation rule
That is how the eval set becomes a living asset instead of a one-time spreadsheet.
Observability still matters
Evals do not replace runtime measurement.
Track:
- latency
- cost per successful task
- retry rate
- review rate
- fallback or escalation rate
A model change that keeps quality flat but doubles latency or cost is still a product decision, not a free improvement.
Copyable eval worksheet
For one workflow, write down:
- Task definition
- Desired output contract
- Primary metric
- Guardrail metric
- Gold examples count
- Hard edge cases count
- Release threshold
If any of those is blank, do not call the workflow “evaluated” yet.
Common failure modes
- using one generic rubric for every task
- evaluating only the final answer when retrieval is the real problem
- excluding edge cases because they lower the score
- never updating the eval set after production failures
- ignoring cost and latency after a “quality win”
How StackSpend helps
Model evaluations are easier to act on when they are paired with workflow-level cost changes. That lets teams answer the real question: did the quality improvement justify the added spend, or did the cheaper model stay within acceptable thresholds?