Skip to content
← Insights
Conversation design & agentic·June 30, 2026·8 min read

Design the conversation, not just the prompt

A guidance product is a designed path through the moments that decide it, not one clever prompt. Map the states, the moves, and the transitions.

A team writes a good system prompt for their coaching agent and ships it. The first session goes fine. The user says hello, the agent asks what is on their mind, the answers are warm and useful. Then a real user arrives who is guarded on turn two, drops something heavy on turn nine, and by turn twenty is ready to commit to a plan and just wants the agent to hold them to it. The same prompt handles all three of those moments the same way. It reflects when the user wanted a push. It pushes when the user needed room. Nothing in the prompt knew which moment it was in, because a prompt is one instruction and a session is a path.

A conversation has a shape, and the prompt is not it

The idea is small and it changes how you build. A guidance product is not a single answer engine. It is a path a user walks, through a handful of recurring situations, and the product's job is to make the right move at each one. Call that path the conversation architecture, the named set of states a user can be in, the move that fits each state, and the allowed transitions between them. It is the map. A prompt describes a voice. The architecture describes where that voice is standing.

Two terms to pin before we use them. A state is a situation a user arrives in that calls for a different response, guarded and testing you, disclosing something hard, stuck and circling, ready to act. Routing is the step that reads the current turn, decides which state you are in, and sends the turn down the branch built for that state. A general model does this implicitly and inconsistently, re-improvising the shape of the conversation on every turn. Designing it means writing the states, the moves, and the transitions down so they are the same on turn two and turn two hundred, and so you can test each one.

This is the architecture side of a conversational product. The character side, holding a warm, consistent voice turn to turn, lives in the personality harness. The two are different assets. The harness governs how the agent sounds when it makes a move. The architecture governs which move it makes and when. A product needs both, and confusing them is why teams keep editing the prompt to fix a routing problem.

Why one prompt cannot hold state and flow

The mechanism is worth walking, because the failure looks like a wording problem and is not. A system prompt is a fixed instruction the model reads on every turn. It has no memory of the path, so it re-derives the whole situation from scratch every single turn, and it does the average thing. It tries to be a little supportive, a little curious, and a little directive at once, which is exactly the stacked, hedge-everything reply that feels off. Asking one call to classify the moment, choose the move, and write the reply is asking it to hold three jobs in one instruction, consistently, across a long session. It will not.

Anthropic's Building Effective Agents draws the line that dissolves this. A workflow is a system where the code orchestrates the model through predefined paths. An agent is a system where the model directs its own path. Guidance conversations are mostly the first kind. You know the handful of situations a session moves through, so you can lay track for them instead of hoping the model lays its own each time. Two of Anthropic's named patterns do the work. Prompt chaining decomposes a task into a sequence of steps where each call handles one easier piece and you can add a programmatic check between them. Its stated purpose is to trade latency for higher accuracy by making each call an easier task. Routing classifies an input and directs it to a specialized follow-up, which lets you write a focused prompt per branch instead of one prompt that tries to cover every case at once.

Apply that to a session. One small, fast call reads the turn and labels the state. That is routing. A second call, holding the prompt written for that one state, writes the reply. That is a two-link chain. Now the guarded-user branch and the ready-to-act branch are separate, testable prompts, and the classification is a step you can inspect and correct rather than a decision buried inside a paragraph of reply text. DeepMind's Sparrow is the older proof that decomposition beats one global instruction. Its team broke good dialogue into 23 specific rules and graded each one separately, and under adversarial probing Sparrow broke a rule 8 percent of the time, roughly three times less often than the baseline dialogue model its raters tried to trick. Behavior gets reliable when it is governed in small, checkable units.

Draw the map from transcripts, not from a whiteboard

The practice is one exercise, and the order matters. Read real transcripts first, then draw the map. Pull thirty to fifty real sessions and mark, turn by turn, the moments where the agent made the wrong move, then cluster them. You are looking for the five or six states that actually recur, not the twenty you can imagine. A coaching product usually lands on something like opening and testing, disclosing something hard, stuck and circling, ready to commit, and off the rails or in crisis. For each state, write two things. The move, the one thing a response in this state is trying to do, whether that is to reflect, to clarify, to challenge, or to hand off. And the allowed transitions, which states this one can lead to, so a jump from guarded straight to a hard challenge is a thing your map can flag as wrong.

That map is the durable asset. It is what a general model only improvises. When you swap models, the prompt and the weights change, but the states, the moves, and the transitions are yours, and you re-score the new model against them in an afternoon to see whether it still routes the way you designed. Make each stage testable on its own. Google's Vertex agent evaluation scores not only the final reply but the trajectory, the sequence of steps the system took, with metrics like trajectory_exact_match that check whether the path matched the reference path. For a conversation that means an eval per state (given a turn in this state, did the router pick the right state and the reply make the right move) plus a check on the transitions, so a regression shows up as the agent taking the wrong branch on turn four rather than as a vague sense that the session felt off. That per-turn instrumentation is the same discipline the trust briefings apply to the moments that decide the product, covered in what to measure.

The rigid flowchart is its own failure

The cost is real and it cuts two ways. Draw the map before you read a single transcript and you build a fifteen-state flowchart that predicts nothing, looks like rigor, and locks the agent into branches real users never take. That is the more expensive mistake than shipping a plain prompt, because you now maintain a machine that fits an imagined conversation instead of the one your users are having. The map has to come out of transcripts, and it has to stay small.

And routing is not free. Every extra call is another round trip, so a two-link chain roughly doubles the latency of the turn before the model has written a word. Anthropic is blunt that this is the whole tradeoff, that agentic systems trade latency and cost for better performance, and that for many applications optimizing a single call with good examples and retrieval is enough. So when a single prompt genuinely holds the character and the moves across your real sessions, keep it. Reach for routing when transcripts show the agent making the wrong move because it could not tell which state it was in, not before. The point is not to decompose everything. It is to decompose the one decision, which state am I in, that a single prompt keeps getting wrong.

One more branch the map has to carry, because it is the one you cannot afford to route wrong. The off the rails or in crisis state is not a coaching move, it is a handoff, and its transition rules are stricter than the rest. When to refuse, when to answer inside the line, and when to route the user to a person is its own design problem, covered in refuse and escalate. If your agent calls tools, the trajectory it takes through them is a second thing to evaluate, the subject of evaluate tool use.

Design the path, then test each step of it

The chain of command in OpenAI's Model Spec, where a system instruction outranks a developer one, which outranks a user one, is the model admitting that not all instructions are equal and that behavior needs a resolved structure, not one flat blob of text. A conversation needs the same. The states a user moves through, the move each one calls for, and the transitions between them are that structure for a dialogue, and they are the part a general model will improvise differently every time you do not write them down.

So the rule is short. Read the transcripts, name the five or six states that recur, write the move and the allowed transitions for each, route to them with the smallest chain that holds, and put an eval on every state and every transition. Keep the single prompt as long as it holds and add a branch only when a real failure proves the model cannot tell which moment it is in. The prompt is the voice. The architecture is the path, and the path is the product.

Sources and further reading

  1. Building Effective Agents. Anthropic, Anthropic, 2024
  2. Improving alignment of dialogue agents via targeted human judgements (Sparrow). Glaese et al., DeepMind, 2022
  3. Introducing agent evaluation in Vertex AI Gen AI evaluation service. Google Cloud, Google, 2025
  4. Model Spec. OpenAI, OpenAI, 2025

Work with Hunter Green

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.