Use this when you need model output to feed code, queues, dashboards, or downstream decision logic.
The short answer: if a workflow ends in a parser, database write, route decision, or score, treat the LLM like a typed interface instead of a chatbot. Define the schema first, then prompt the model to fill it.
What you will get in 10 minutes
- A practical rule for when to use structured outputs
- The difference between extraction, classification, and scoring contracts
- A simple input/output worksheet you can copy into a spec
- Evaluation metrics that fit each task type
Use this when
- You need consistent keys, enums, or score ranges
- The response will be used by code, not read by a human first
- You are retrying often because free-form answers break parsing
- You want a workflow that survives model swaps more safely
The 60-second answer
| If your task is... | Your output contract should look like... |
| --- | --- |
| Extraction | typed fields with nullable values and evidence text |
| Classification | one enum plus optional rationale |
| Scoring | numeric range, rubric version, and explanation |
| Routing | constrained enum plus confidence and escalation reason |
The core idea is simple: write the output schema before you write the prompt.
OpenAI's Structured Outputs guide recommends schema-constrained outputs over older JSON mode when supported, because schema adherence is stronger than “valid JSON” alone. That matters most when your workflow depends on required fields, valid enums, or programmatic refusals.
Why free-form prompting breaks in production
Free-form prompting usually fails in one of four ways:
- A required field is omitted.
- A label falls outside the allowed options.
- The model answers the question but not in the expected shape.
- Safety refusal or ambiguity is mixed into the payload in a way your code cannot interpret.
Those failures create hidden engineering cost:
- retry loops
- brittle regex parsing
- silent misroutes
- human cleanup work
Pattern 1: Extraction
Use extraction when you want the model to normalize messy text into fields.
Examples:
- invoice fields from PDFs
- lead qualification from sales notes
- issue metadata from support conversations
Good extraction schema shape:
{
"customerName": "string | null",
"contractValue": "number | null",
"currency": "enum | null",
"evidence": [
{
"field": "string",
"quote": "string"
}
],
"needsReview": "boolean"
}
Design rules:
- Allow
nullwhen evidence is missing. - Capture evidence for high-value fields.
- Include
needsReviewfor uncertain cases. - Do not force the model to invent missing values.
How to evaluate extraction
- Field-level precision
- Field-level recall
- F1 by field
- Review rate for uncertain records
Pattern 2: Classification
Use classification when the answer must come from a short, known label set.
Examples:
- support ticket triage
- policy category assignment
- urgency or sentiment bucketing
Good classification schema shape:
{
"label": "billing | bug | feature_request | other",
"confidenceBand": "high | medium | low",
"reason": "string",
"needsReview": "boolean"
}
Design rules:
- Keep the label set small and explicit.
- Make “other” available when appropriate.
- Separate label from explanation.
- Use confidence bands for routing, not as a substitute for evaluation.
How to evaluate classification
- Accuracy
- Precision and recall by label
- Confusion matrix
- False-positive cost for sensitive routes
Pattern 3: Scoring
Use scoring when the model is grading against a rubric rather than picking a label.
Examples:
- lead quality score
- support response quality score
- document relevance score before reranking
Good scoring schema shape:
{
"score": 4,
"scaleMin": 1,
"scaleMax": 5,
"rubricVersion": "v1",
"reason": "string",
"needsReview": false
}
Design rules:
- Version the rubric.
- Keep the scale fixed.
- Explain what each score means outside the prompt as well.
- Never use a model score operationally without sampling and calibration.
How to evaluate scoring
- Correlation with human raters
- Agreement by bucket
- Threshold precision for “send” vs “review”
- Drift over time after prompt or model changes
Function calling vs structured outputs
Use structured outputs when the model should return a shaped answer.
Use function or tool calling when the model should trigger an action.
A good pattern is:
- first classify or extract with a schema
- then call a tool only if the output passes validation
That keeps business logic outside the model while still using the model for interpretation.
Failure modes to watch
- Forcing required values when the input is incomplete
- Treating explanations as evidence
- Using confidence as truth instead of as a review signal
- Letting the schema grow until it becomes a fragile mini-database
If a task needs exact calculation, deterministic validation, or stable lookups, do that in code after the model step.
Copyable I/O contract worksheet
Before shipping, answer these seven questions:
- What exact decision or record does this output power?
- Which fields are required, optional, or nullable?
- Which values are constrained enums?
- What counts as insufficient evidence?
- When should the workflow escalate to review?
- Which metrics will you check weekly?
- Which downstream system will reject malformed output?
How StackSpend helps
Schema-based workflows are easier to measure because the task boundary is clearer. That makes it easier to track cost by workflow, compare model tiers, and see whether retries or review queues are increasing spend after launch.