Use this when your team is building a support assistant, product copilot, or internal chat workflow that needs to do more than answer one isolated question.
The short answer: production chat works best when you separate session memory, retrieval, tool use, and escalation. A single giant prompt is usually the wrong architecture after the prototype stage.
What you will get in 11 minutes
- A practical chat architecture for production systems
- Which kinds of memory actually help
- When to escalate instead of pushing the assistant harder
- The key metrics for containment and handoff quality
Use this when
- Users return to the same conversation over time
- The assistant needs retrieval or tool access
- Wrong answers create support, compliance, or trust problems
- Human takeover is part of the service design
The 60-second answer
Think about production chat as four layers:
- session state
- memory and retrieval
- action layer
- escalation layer
Anthropic's support-chat guidance is useful here because it treats chat as a collection of tasks: greeting, information retrieval, staying on topic, taking action, and escalating when needed. That is a better mental model than “one chatbot prompt.”
Memory is not one thing
When teams say “we need memory,” they often mix together several separate jobs.
Session memory
Use for:
- current conversation state
- recent clarifications
- the last few actions
This is short-lived and should stay tightly scoped.
User or account memory
Use for:
- preferences
- known account attributes
- durable profile context
This should not be rebuilt from every session transcript.
Retrieval memory
Use for:
- policy docs
- help center content
- product knowledge
This is not “memory” in the usual sense. It is grounded context pulled in when needed.
A good production chat architecture
1. Route the incoming turn
First decide what kind of turn it is:
- informational question
- action request
- account-specific issue
- off-topic or unsupported
- escalation candidate
That small routing step creates cleaner downstream behavior than trying to solve everything in one prompt.
2. Add only the right context
Do not keep appending every previous turn forever.
Prefer:
- short session summary
- relevant account facts
- retrieved policy or product snippets
- tool results in structured form
That keeps the prompt smaller and easier to control.
3. Separate answer generation from action execution
If the user wants an answer, generate it from retrieved or account context.
If the user wants an action, validate the request and call the right tool or human workflow.
Mixing explanation and action logic in one step makes failures harder to debug.
4. Escalate on purpose
Escalation is not failure. It is a product feature.
Escalate when:
- policy risk is high
- the model lacks enough evidence
- the user is frustrated or stuck
- the requested action has business or compliance impact
What a strong handoff includes
A handoff should not dump the full transcript on a human and hope for the best.
Include:
- reason for escalation
- customer intent
- actions already attempted
- relevant retrieved evidence
- confidence or uncertainty summary
That turns escalation into a clean transfer, not a restart.
Metrics that matter
For production chat, track:
- containment or deflection rate
- escalation accuracy
- resolution quality
- repeat-contact rate
- average cost per resolved conversation
High containment is not automatically good. If containment rises while repeat contacts or negative reviews rise too, the assistant is probably over-answering.
Copyable chat architecture checklist
For one workflow, answer:
- What turn types exist?
- What context belongs in session memory?
- What knowledge should come from retrieval instead?
- Which actions require tools?
- Which cases must escalate?
- What should be included in the handoff payload?
Common failure modes
- storing too much transcript instead of summarizing
- treating retrieval as memory
- no explicit escalation path
- using the same prompt for information and actions
- measuring only containment instead of resolution quality
How StackSpend helps
Chat systems often hide cost growth in longer sessions, excessive context carry-forward, and avoidable escalations. Tracking cost by workflow and feature helps teams see whether chat is getting more efficient or just more expensive.