Conversation design & agentic·June 30, 2026·7 min read

Evaluate the trajectory and the reliability

Once an agent takes actions with tools, grading the final answer hides where it went wrong and how often it will.

Part ofConversation design and agentic capability →

An agent looks up a customer, cancels the wrong order, then apologizes so well that the final message reads clean. Your eval scores the message, sees a polite confirmation, and passes the run. The cancellation still happened. Grade only the last thing the agent says and you are certifying the words, not the actions, and the actions are the whole reason you gave it tools in the first place.

Worse, that run might have passed by luck. Re-run the same task with a different customer and the agent takes a different path, calls a different tool, and fails. A single green check tells you it can succeed once. It does not tell you it will succeed the next time a real user asks.

The answer is the tip, the trajectory is the iceberg

The idea is small and it changes what you measure. Evaluate the trajectory, the ordered sequence of tool calls and steps the agent took to get there, and evaluate it across repeated runs, not the final answer on a single pass. Two things have to be true for a tool-using agent to earn trust. It has to reach the right end state, and it has to do that reliably when you run it again.

This is the part that separates a tool-using agent from a chatbot. A briefing on measuring process adherence in a conversation covers scoring whether the system followed the method turn by turn. Tool use raises the stakes, because now a wrong step is not a worse sentence, it is a database write, a refund, a cancelled order. The trajectory is where those actions live, and the final message is the one place they can hide.

Why a passing run is not a reliable one

Two mechanisms turn a single green check into false confidence. The first is compounding step error. An agentic task is a chain, and the agent has to get every link right. Even at a high per-step success rate, the chain multiplies. Fifteen steps at ninety-five percent each land near forty-six percent end to end, because 0.95 to the fifteenth power is what actually decides the run, not the ninety-five. A final-answer check sees only the last link and cannot tell a clean chain from one that recovered from three wrong turns by luck.

The second is nondeterminism. The same prompt does not run the same way twice. Production language models are not deterministic even at temperature zero, so the agent that took the right path this morning can take a different one this afternoon on the identical task. Sierra built a metric for exactly this. Their tau-bench benchmark introduced pass^k, which measures whether an agent succeeds on all k independent attempts at the same task, not whether it succeeds once. It is a reliability number where the ordinary pass@1 is a capability number.

The number is the argument. On tau-bench's retail domain, a GPT-4o agent scored about sixty-one percent on pass@1 but dropped to roughly twenty-five percent on pass^8. Sierra translates that plainly: there is only about a twenty-five percent chance the agent resolves eight cases of the same issue with different customers. pass^k collapses fast because it is pass@1 compounded across k runs, so any per-run flakiness gets raised to the k. The follow-on tau2-bench pushed the same test into a dual-control telecom setting where the user also acts, and the reliability gap held.

Score the path, then re-run it

Two moves, in order. First, log the full trajectory and score both the outcome and the path. Google's Vertex agent evaluation service names the trajectory metrics you can actually compute against a reference run. Trajectory exact match scores one when the tool calls match in the same order, in-order and any-order matches loosen that, and trajectory precision and recall ask how many of the agent's calls were relevant and how many required calls it made. Outcome tells you whether the end state is right. Trajectory tells you whether a passing outcome came from a sound path or a lucky one that took a forbidden step on the way.

Second, report pass^k on the tasks where a wrong action costs something real. Pick the flows that write to a system or move money, run each some number of times, and publish the fraction that pass every time, not the fraction that pass once. That number is the one a buyer should see, because it answers the question they actually have, which is whether the thing works for the next customer, not whether it worked in the demo.

Trajectory grading is expensive and brittle

This is not free and it is not always the right tool. Exact-match trajectory checks are brittle by construction. There is often more than one correct path, so a reference-trajectory comparison flags a valid alternate route as a failure and you spend your week explaining false alarms instead of fixing agents. That is why Vertex ships the looser in-order and any-order variants, and why for open-ended tasks a rubric-based judge on the trajectory often beats an exact match, at the cost of having to validate that judge against human labels before you trust it.

Reliability testing multiplies the bill directly. pass^k means running every task k times, so a k of eight is eight times the eval cost in tokens, latency, and the labeling behind any human or model grader. On tau-bench, agents cleared fewer than half of tasks at pass@1, and WebArena found the best GPT-4 agent finished only 14.41 percent of its 812 real-world web tasks against a human rate of 78.24 percent, so on a hard domain most of those k runs are failures you are paying to observe. Reserve full trajectory scoring and a high k for the flows where a wrong action is expensive. For a read-only lookup that cannot damage anything, a cheap outcome check is enough.

Report the number that survives a re-run

The trend line says why this stays worth the cost. METR found that the task length frontier models can handle at fifty percent reliability has been doubling roughly every seven months, but their eighty percent horizon is about five times shorter than their fifty percent one. The same agent that clears an hour-long task half the time clears only a ten-minute task reliably. Capability is climbing and reliability trails it, which is precisely the gap a single passing run hides.

So the rule is short. For anything that takes an action, score the trajectory, not just the answer, and quote pass^k, not pass@1, on the tasks where a wrong step costs a customer. A demo proves the agent can. Only a re-run proves it will.

Sources and further reading

tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. Yao, Shinn, Razavi, Narasimhan, Sierra, ICLR 2025
tau-bench: Benchmarking AI agents for the real-world. Sierra, Sierra, 2024
tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. Barres, Dong, Ray, Si, Narasimhan, Sierra, 2025
WebArena: A Realistic Web Environment for Building Autonomous Agents. Zhou et al., CMU, ICLR 2024
Introducing agent evaluation in Vertex AI Gen AI evaluation service. Sigler and Nardini, Google Cloud, 2025
Measuring AI Ability to Complete Long Tasks. Kwa, West et al., METR, 2025

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.

Pressure-test my product →See how we work →

Evaluate the trajectory and the reliability

The answer is the tip, the trajectory is the iceberg

Why a passing run is not a reliable one

Score the path, then re-run it

Trajectory grading is expensive and brittle

Report the number that survives a re-run

Related reading

Design the conversation, not just the prompt

When a conversation is not enough

Voice changes the harness

Bring us the hardest moment in your product.