Skip to content
← Insights
Conversation design & agentic·June 30, 2026·8 min read

When a conversation is not enough

Sometimes a guidance agent has to act, not only advise. Giving it real tools turns talk into an outcome, and every tool is a new door in.

A user tells the coach they finally want to book the follow-up they have been putting off for a month. The coach writes a warm, correct paragraph about how to book it. Then it stops. The user has to leave the conversation, open another tab, find the scheduler, and do the thing the coach just talked them into. Most of them do not. The advice was good and the outcome never happened, because the product could talk about the world but could not touch it. The gap between a right answer and a done thing is where a lot of guidance products quietly leak their value.

Talk is cheap when the outcome lives in another system

The fix is to give the model tools. Tool use, also called function calling, is the mechanism where you hand the model a small menu of actions it can take (look up the user's plan, check open appointment slots, write the booking, update the goal) and the model decides when to call one and with what arguments. The model does not run the code. It emits a structured request, your system executes it, and the result comes back into the conversation. OpenAI's function calling docs are blunt about the split. The model can decide it needs a tool and produce the call, but the API will not execute anything, so the developer runs the function and returns the result. The model talks; your code acts.

One line of vocabulary before we go further, because the two words get used loosely and the difference is the whole piece. A workflow is a system where you write the steps and the model fills in the blanks at fixed points you chose. An agent is a system where the model itself decides the steps, calling tools in a loop until it judges the task done. Anthropic draws exactly this line in Building Effective Agents, and its one-sentence definition of an agent is worth keeping. Agents are LLMs using tools based on environmental feedback in a loop. The loop is the part that matters, and it is what separates acting from advising.

The loop is reason, act, read, repeat

Here is the mechanism, walked all the way through, because the loop is the insight and hand-waving it would defeat the purpose. The user says book my follow-up. The model reasons about what it needs, then acts by calling check_slots for next week. Your system runs that query against the real calendar and hands back three open times. The model reads that result, reasons again (the user said mornings, one of these is a morning), and acts by calling book_appointment with that slot and the user's ID. Your system writes the booking and returns a confirmation number. The model reads the confirmation and closes the conversation with a specific, true sentence. You are set for Tuesday at 9. Reason, act, read, repeat, until the model decides the task is finished.

The interleaving is not decoration. It is what keeps the agent honest about the actual state of your system instead of a plausible story about it. This is the finding behind ReAct (Yao et al., ICLR 2023), the paper that named the reason-then-act pattern. When the researchers let a model interleave reasoning steps with real lookups against a Wikipedia API, it overcame the hallucination and error propagation that pure chain-of-thought reasoning falls into, because every few steps it was reading a real result instead of continuing to reason from its own guess. On the interactive benchmarks ALFWorld and WebShop, ReAct beat imitation and reinforcement learning baselines by 34 and 10 absolute points of success rate, prompted with only one or two examples. A model that only reasons drifts. A model that reads the environment between reasoning steps stays anchored to what is actually true.

That is also why an agent is more than a longer conversation. The tool result is ground truth the model did not invent, and Anthropic's guidance stresses exactly this, that the agent should gain ground truth from the environment at each step, a tool result or a code execution, to check its own progress. Take the loop away and you are back to a coach describing a booking it cannot make.

Start with the simplest thing that works

The temptation, once tools work at all, is to hand the model a wide-open loop and let it figure everything out. Resist it. The discipline that holds up is the one Anthropic states plainly in Building Effective Agents, find the simplest solution that works and add complexity only when the simpler thing falls short. Most of what looks like it needs an agent is a workflow with a couple of tool calls at known points. If the path is predictable, hardcode the path and let the model do the language, not the routing.

Add autonomy only where it pays, and only one tool at a time. Each tool is a new capability and a new failure surface, so treat it like shipping a feature, not editing a prompt. Constrain what it can reach before you widen it. A book_appointment tool should take the current user's ID from the session, not accept an arbitrary one from the model, so a wrong argument cannot book on someone else's account. Anthropic's guide to writing tools for agents makes tool design the place to spend the effort, with clear names, tight scopes, and results shaped so the model can actually use them. A read-only lookup is a safe first tool. A tool that writes to a record, moves money, or sends a message to a third party is the one that earns a slow rollout and a tight permission boundary.

Whatever you give it, you have to measure whether it used the tool correctly, called the right one, with the right arguments, at the right moment, which is a different test from whether the final sentence read well. That is its own discipline, covered in the sibling briefing on evaluating tool use. And the conversation around the tools, when to ask a clarifying question before acting, when to confirm before a write, is the flow work in designing the conversation. This piece is about the jump from talking to acting. Those two are how you make the jump safe and how you prove it landed.

Autonomy compounds error, and a tool is a new door

Every step an agent takes is a step it can take wrong, and the errors multiply along the chain. The clearest measurement of this comes from tau-bench (Yao, Shinn, Razavi, and Narasimhan, ICLR 2025), Sierra's benchmark that puts a tool-using agent in a real customer-service loop with a simulated user and domain rules, then checks the actual database state at the end against the intended one. Even a strong function-calling model like gpt-4o succeeded on fewer than half the tasks. Worse, it was unreliable. The paper's pass^8 metric asks whether the agent solves the same task on all eight independent tries, and in the retail domain that number fell below 25 percent. Run the same booking for eight different users and the agent gets all eight right less than a quarter of the time. A single good demo tells you almost nothing about that.

The cost is real in three currencies. Latency, because every tool call is a round trip and a multi-step loop stacks them. Money, because each step is another model call on a growing context. And a widened attack surface, because a tool is a new door into your system and the model deciding when to open it is a new thing an attacker can try to talk into opening the wrong one. That last risk has its own name and its own briefing, on prompt injection, where instructions hidden in a tool result or a user message try to hijack what the agent does next. A pure chat product cannot leak a record or send a message it should not. An agent with a write tool can.

So the honest answer is sometimes not to build an agent at all. Anthropic says it in the same breath as the rest of the guidance, that the right build might mean not building an agentic system, because agents trade latency and cost for flexibility you may not need. If the task is predictable, a workflow is faster, cheaper, and easier to keep inside the lines. Reach for the loop when you genuinely cannot script the path and you can still check the result, and not before.

Give it hands only where the outcome is worth the risk

The jump from chat to action is a real upgrade to a guidance product. The coach that can book the appointment, update the plan, or pull the user's actual history closes the gap between advice and outcome that talk alone leaves open. The mechanism is a loop where the model reasons, calls a tool, reads a real result, and continues, and the loop is what keeps it anchored to your system instead of a story about it.

The rule to carry is small. Start with the simplest workflow that works, add a tool only where the outcome pays for the cost and the risk, scope each tool tight, and measure reliability across many runs before you trust it with anything that writes. Give the model hands where the outcome is worth it, and keep them tied where it is not. The open question the field has not settled is where that line sits as models get more reliable in the loop. For now, the safe default is that a right answer the user still has to act on is a smaller failure than a wrong action the product took on its own.

Sources and further reading

  1. Building Effective Agents. Anthropic, Anthropic, 2024
  2. ReAct: Synergizing Reasoning and Acting in Language Models. Yao et al., ICLR, 2023
  3. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. Yao, Shinn, Razavi, Narasimhan, ICLR, 2025
  4. Function calling. OpenAI, current
  5. Writing effective tools for AI agents. Anthropic, Anthropic, 2025

Work with Hunter Green

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.