A support agent tells a user their prescription is safe to double up on. A coaching bot congratulates someone for cutting off their whole family. A learning tutor confidently teaches a wrong proof. None of these turns will show up as an error in your logs, because the model was fluent and the user did not complain. They show up when a person reads the transcript and says, that answer was wrong, and here is why. That person is your expert, and in a guidance product they are doing two jobs at once that most teams treat as one. They are the quality bar, and they are the compliance control.
Both jobs are usually underbuilt. The quality job is catching the wrong answer that reads fine. The compliance job is the human oversight that regulators now require, and it is worth being precise about what that phrase means before you build to it.
Human oversight is a regulatory term, not a vibe
When the EU AI Act says human oversight, it means something specific. Under Article 14, a high-risk AI system has to be built so that a named natural person can understand its limits, monitor it in use, and above all decide not to use its output in a given situation. The article spells out the powers that person must have. They can disregard, override, or reverse an output, and they can interrupt the system through a stop button or an equivalent that brings it to a safe halt. The commentary on the article is blunt that this override has to be exercisable by the designated person alone, without asking permission up the chain for each call.
So human oversight in the regulation is not a review meeting. It is a specific person with the authority to stop a specific output from reaching a user, and the design that makes stopping possible. That is the same authority a good quality process needs. The expert who reads transcripts and says this answer is wrong is only useful if they can also say and therefore this release does not ship. Fuse the two roles on purpose, because the regulation is describing the quality function you should have built anyway.
One accountable expert beats a committee
The first instinct is to spread review across a panel. More eyes, more coverage, less bias. For grading conversations it backfires. The problem is agreement. Surveying what makes human labels reliable, Lilian Weng points to a safety labeling study where rater agreement runs from 0.96 on violence and gore down to 0.25 on personal topics. Raters line up on the clear cases and split on the personal ones, and the personal ones are exactly the trust moments a guidance product lives or dies on. A committee grading those turns does not average into truth. It averages into mush, and every conflict has to be relitigated.
Hamel Husain, who runs error analysis on real products for a living, recommends the opposite of a panel. Appoint one domain expert, a benevolent dictator, as the definitive voice on quality. For a mental health product that is a clinician, for legal document review a lawyer, for a coaching product the practitioner whose method the product encodes. A single owner eliminates annotation conflicts and the paralysis of too many cooks. This is not about intelligence, it is about consistency. The same person applying the same judgment across a hundred transcripts produces a signal you can track over time. Five people applying five slightly different standards produce noise you cannot.
Consistency is the whole asset. If the bar drifts because a different reviewer graded this week's transcripts, you cannot tell a real regression from a change of grader. One accountable expert makes the bar a fixed thing you can regress against, and gives the compliance side a single name when a regulator asks who is responsible for oversight.
Sample by risk, not at random
Expert time is the scarce resource in this whole system, so the expert cannot read everything, and random sampling is the wrong way to choose. A random sample is dominated by easy turns, because a guidance product spends most of its life on easy turns. You burn the expert's attention confirming that the model handled small talk while the one boundary turn that mattered sits unread in the other 99 percent.
Prioritize the queue by risk. Pull the turns where trust is won or lost first, the boundary request, the moment of doubt, the hard disclosure, the turn where the model gave advice it was not qualified to give. These are the same turns the what to measure briefing argues you instrument on their own dashboard, because an average hides them and you have to go find them. Route the transcripts your risk signals flag, plus a thin random stratum so you still catch failures your signals do not yet know to look for.
When the expert reads a transcript, capture the verdict as structure, not a note in a doc. A pass or fail, the specific rule that broke, and a sentence of why. Hamel's method for this is adapted from qualitative research. Read traces, write open ended notes, then group the notes into a failure taxonomy, and keep going until new transcripts stop surfacing new failure modes, roughly a hundred traces or more before the categories stabilize. The structured verdict is what makes the next move possible.
Oversight is where new eval criteria come from
Here is the part that turns review from a cost into a compounding asset. The reason you cannot just write the eval rubric once and automate the expert away is that you do not know the full rubric yet. You discover it by grading. Shreya Shankar and colleagues at Berkeley named this criteria drift. In their study of people building LLM graders, users needed criteria to grade outputs, but grading the outputs was what taught them the criteria. Some standards only became visible once the expert saw a specific model output and reacted to it. The criteria are not fully definable in advance, so the grader cannot be calibrated once and left alone.
That is not a bug in the process. It is the mechanism that makes expert review pay off. Every transcript the expert fails for a reason the rubric did not yet name is a new eval criterion. The clinician flags that the model kept coaching a parent who described a feverish infant instead of saying call a pediatrician. That judgment becomes a checkable rule, which becomes a row in the golden set of graded examples the whole suite runs against. The next model change gets tested against it automatically. Human oversight is the source of the tests, and the tests are what let the human stop reading every turn.
This is the loop that connects review to the eval suite the method to product behavior briefing describes as the product's real spec. Each expert judgment feeds the golden set, each new golden example tightens the automated gate, and the expert moves from grading everything to grading the new and the ambiguous while the machine handles the settled cases on every release, the way catch the regression gates each change on the full suite. Review does not scale to every turn. It does not have to, because its output is the thing that scales.
Oversight theater is the failure mode
The tradeoff is that human oversight is easy to fake and hard to make real, and a faked version is worse than none, because it satisfies the audit while catching nothing. The AI Act anticipates this. Article 14 requires the overseer to stay aware of automation bias, the human tendency to over-rely on an automated output and defer to it even when they should not. A reviewer who clicks approve on a fluent answer because it reads confident is oversight on paper and rubber stamping in practice.
Two legal scholars push this harder than the regulation does. Johann Laux and Hannah Ruschemeier argue in Automation Bias in the AI Act that mandating awareness of automation bias does not fix it, that the design and context of the interface are what drive over-reliance, and that the honest test of whether oversight works is empirical, you have to measure whether the human actually catches and overrides bad outputs. Awareness is not a control. A reviewer with no real authority, no time, and a UI that nudges toward approval will defer, and the box gets checked anyway.
So the cost to name is that real oversight is expensive in the one resource you cannot fake, senior expert attention on the turns that matter, backed by the authority to actually stop a release. An expert who can flag a failure but cannot block the ship is theater. One who can block it but only ever sees a random sample of easy turns is theater with better production values. Neither catches the doubled prescription. The governed environments briefing treats human oversight as a control you design in from the first sprint, and this is why. Bolted on at security review, it becomes the stamp. Designed in, it is the person who can read the risky turn and stop the wrong answer before a user acts on it.
The rule to carry
Put one accountable expert on a queue sorted by risk, give them the standing authority to stop a release, and turn each verdict they render into an eval the machine runs from then on. That single design satisfies the quality job and the compliance job with the same work, because they were never two jobs. NIST files the same idea under a named risk, human-AI configuration, and its AI Risk Management Framework puts govern and measure next to each other on purpose. You govern by naming who is accountable, and you measure by counting whether they catch the bad output.
The open question worth sitting with is the one Laux and Ruschemeier raise. How do you prove the oversight is real and not theater. The answer is probably the same discipline you already apply to the model, which is to measure it. Track how often the expert overrides, on which kinds of turns, and whether the failures they catch are showing up in the golden set and then disappearing from production. Oversight that works leaves that trail. Oversight that is a stamp leaves an empty log and a clean audit.
Sources and further reading
Work with Hunter Green