Skip to content
← Insights
Evals & quality·July 1, 2026·8 min read

Close the loop from production to eval

The product that improves every week is the one where each real-world failure becomes a permanent test the next release has to pass.

A parenting coach ships, and the transcripts look fine. Then a support ticket comes in. A user asked to check in again in two weeks, and the coach booked the follow up for the wrong day, twice. You read the one thread, patch the prompt, and move on. A month later the same class of failure surfaces in a different skin, this time on 'end of next week.' You never learned how often it was happening, so you never fixed the category, only the instance in front of you. The loop between what your product does in production and what your evals check was never closed.

The loop is the product, not the eval

Here is the whole idea in one sentence. A guidance product gets measurably better every week when each recurring failure you find in real use becomes a new eval and a behavior fix, so the same mistake cannot ship twice. The eval suite is the asset that compounds, but the thing that feeds it is a loop, and the loop has a shape. Production telemetry, then error analysis, then a new eval, then a gate. Most teams have the first piece and the last piece. The two in the middle are where the improvement actually comes from, and they are the two everyone skips.

Sibling briefings own the ends of this loop. One covers what to measure once you decide to instrument a turn. Another covers gating a change on the full suite before it ships. This one owns the span between them, the part where a raw stream of production traces becomes the next eval you gate on.

Error analysis is reading, not dashboards

Error analysis is the practice of reading a sample of real production traces by hand, writing an open ended note on each thing that went wrong, then grouping those notes into named failure categories and counting how often each one fires. A trace is the full record of one interaction, the user turns, the model turns, any tool calls, and the final output. The word 'analysis' is doing real work here. This is not a metrics dashboard and it is not an average. It is a person sitting with the transcripts.

Hamel Husain, who has run this process across a long list of AI products, is blunt that this is the highest return activity in the whole discipline, and that nothing replaces looking at your data. His worked example is the sharpest argument for it. At Nurture Boss, an apartment industry assistant, the team built a simple viewer with a notes field next to each conversation and annotated dozens of them by hand. The pattern that fell out was specific and countable. The assistant mishandled dates, failing 66 percent of the time when a user said something like 'let's schedule a tour two weeks from now.' No generic metric would have surfaced that. It only shows up when a human reads the turn and writes down what broke.

Why does reading beat guessing. Because the failure that matters is almost never the one you would have predicted from the prompt, and sitting with the model's own outputs is what shifts your sense of what 'good' even means. The authors of What We Learned From a Year of Building with LLMs, six practitioners writing together, put it as a daily habit. Read a sample of production logs every day, and the moment you spot a new failure, write an assertion or an eval around it. The reading is where the next eval comes from.

Read the first thing that broke, and count the categories

The mechanism has two parts that are easy to get wrong. First, when a trace fails, find the most upstream error, the very first thing that went wrong, not the last. Downstream mess is usually a symptom of one earlier miss, and if you log the symptom you write an eval for the wrong thing. In the date example, the visible failure is a wrong calendar entry, but the upstream error is the model resolving a relative date against the wrong anchor. Fix and test the anchor, not the calendar.

Second, you code the notes into a taxonomy, a flat list of named failure modes, and then you count. Hamel's stopping rule is concrete enough to schedule. Review at least 100 traces to start, and keep going until you hit saturation, the point where roughly 20 more traces turn up no new category. Saturation is the signal that you have seen the shape of the failures, not every instance of them. The count is what makes this different from reading transcripts for a vibe. 'Date handling, 31 of 100' is a claim you can act on and a target you can move. 'The model seems shaky on dates' is not.

The caught failure becomes a permanent test

The turn from a counted failure into a new eval is the hinge of the whole loop, and it is what makes the suite compound instead of just grow. Once 'date handling' is a named category with 31 hits, you write a handful of golden traces that specify the right answer for that exact situation, and you fold them into the suite you gate every release on. Now the failure is not a memory in someone's head or a line in a patch note. It is a check the next change has to pass, on every model swap and every prompt edit, forever. The first eval is just a test. The one you wrote from a production failure is a guarantee that the same mistake cannot ship again.

This is not a new discovery. Long before language models, the ML Test Score rubric from Google, 28 concrete tests drawn from production ML systems, treated monitoring for new failure modes and turning them into tests as a core measure of whether a system was production ready at all. The same authors' earlier work on hidden technical debt named hidden feedback loops as a class of debt, where a system's own behavior quietly shapes the data it later learns from. A closed production to eval loop is the deliberate version of that. You make the feedback loop explicit and own it, instead of letting it run in the dark.

OpenAI describes the same shape as a flywheel. Its evaluation flywheel guide walks the exact cycle, integrate your graders into CI so every change is gated, then monitor production data to discover the new, subtler failure modes that surface only once the obvious ones are fixed, and feed those back in as fresh evals. The tell that the loop is working is that the failures get harder to find over time. When error analysis on a fresh sample stops turning up categories you have not already gated, you have caught up to your product, and the next sample is where the frontier moved to.

The loop costs reviewer time, and the bar keeps moving

None of this is free, and the cost lands in one scarce place. Reading 100 traces well, coding them, and writing golden examples is expensive human attention, usually your best domain expert's, the person who can tell a real failure from a merely surprising answer. That labeling cost is the reason teams reach for a generic metric instead, and it is exactly the wrong economy, because the generic metric is what missed the 66 percent date failure in the first place. Budget the reviewer time as a standing cost of the product, not a one time setup.

There is a subtler cost, and it is the one that quietly breaks a suite. As you read more traces your standard for good sharpens, so the bar you are grading against moves under you. That is criteria drift, documented by Shreya Shankar and colleagues at Berkeley, who found that people cannot fully specify their criteria until they have graded real outputs, and that the act of grading keeps changing the criteria. In a closed loop this cuts two ways. It is the engine, because a rising bar is what lets you keep finding new failures. It is also a hazard, because an eval you wrote three months ago may now encode a standard you have outgrown, passing traces you would fail today. Re-audit old evals as the bar moves, or the suite slowly certifies a version of good you no longer believe in.

There is a case for not reaching for the full loop yet. If you have not shipped, you have no production traces to read, so error analysis has nothing to chew on. Start from a small set of golden examples and write the behavior down first. The loop is what you run once real users are generating the failures that a synthetic test set would never have thought to include.

Run the loop on a schedule, not on a fire

The rule is short. Every week, sample real production traces, read them by hand, code and count the failure modes, turn the top category into a new eval, and gate the next release on it. The failure you find in production becomes the test that stops it from shipping twice, and the suite tightens a notch each pass. Teams that run this loop on a schedule watch their failures get rarer and stranger. Teams that only run it when a ticket lands keep fixing the same instance in new clothing.

The open question worth sitting with is when to trust a model to do the reading. Error analysis is bottlenecked on human attention, so the pressure to let an LLM cluster the traces and count the categories is real, and partly sound. The catch is that the judge doing the clustering has its own error rate and its own blind spots, and the failures it cannot see are exactly the ones your product most needs a person to catch. For now the durable move is to read the sample yourself, let a model help you organize what you found, and keep a human on the turns where trust is won or lost.

Sources and further reading

  1. A Field Guide to Rapidly Improving AI Products. Hamel Husain, 2025
  2. Why is error analysis so important in LLM evals, and how is it performed?. Hamel Husain, Evals FAQ, 2025
  3. Your AI Product Needs Evals. Hamel Husain, 2024
  4. What We Learned From a Year of Building with LLMs. Yan, Bischof, Frye, Husain, Liu, Shankar, 2024
  5. Building Resilient Prompts Using an Evaluation Flywheel. OpenAI, OpenAI Cookbook, 2025
  6. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Breck, Cai, Nielsen, Salib, Sculley, Google, IEEE Big Data 2017
  7. Hidden Technical Debt in Machine Learning Systems. Sculley et al., Google, NeurIPS 2015
  8. Who Validates the Validators? Aligning LLM-Assisted Evaluation with Human Preferences. Shankar et al., UC Berkeley, UIST 2024

Work with Hunter Green

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.