Evals & quality·July 1, 2026·7 min read

Trust the judge only after you check it against people

The model grading your answers has measurable biases toward the first option, the longer answer, and its own writing. Validate it before you trust it.

You grade thousands of open ended coaching turns a week, so a person cannot read them all. You wire up a second model to score each one against your rubric and you watch the number. The number looks great. Then a release that quietly got wordier sails through, because the grader liked the longer answers, not the better ones. The dashboard was green the whole time. The judge was wrong the whole time, and nothing in the pipeline was built to notice.

The thing scoring your product is itself a model, with its own failure modes, and most teams never check it. This briefing is about the check.

The only grader that scales is also a model

Once your answers are open ended, an exact-match check is useless and a human cannot keep up, so you grade with another model. That pattern has a name. LLM as judge means using one language model to score the outputs of another against a rubric, in place of a person. It is the only grading method that scales to every turn, and the leading evals work leans on it. When Lianmin Zheng and colleagues built MT-Bench to rank chat assistants, they found a strong judge model agreed with human experts more than 80 percent of the time, the same rate two humans agreed with each other, measured across 58 experts. That number is why the practice took off.

The idea in one line. A judge is only a measurement instrument, and an instrument you have not calibrated reports confident numbers that mean nothing. The 80 percent is the ceiling under lab conditions, not what you get for free on your data. Before you trust the judge, you check it against people, the same way you would check any grader you did not write.

The judge has a thumb on the scale, three ways

The biases are not vibes. They are measured, they are consistent, and they are the reason an unchecked judge produces confident nonsense. The MT-Bench paper named three, and each one has since been reproduced on its own.

Position bias is the judge preferring an answer because of where it sits, not what it says. Zheng's team put the same two answers in front of GPT-4 in both orders and it only agreed with itself more than 60 percent of the time. Weaker judges did worse. In one traced case GPT-4 called GPT-3.5's answer more detailed and superior when it came first, then flipped and preferred Vicuna's answer once the order was swapped, on identical content. If your eval always lists the candidate before the baseline, you are measuring the slot, not the answer.

Verbosity bias is the judge rewarding length for its own sake. To show it, the researchers took 23 answers that were already correct and asked GPT-4 to rephrase each list to twice the length while adding no new information, then fed the padded version back. The judge preferred the longer one. A separate study by Keita Saito and colleagues confirmed the direction, finding GPT-4 prefers longer answers more than humans do in preference labeling. This is the bias that lets a wordier release pass while the product gets no better.

Self-preference, or self-enhancement bias, is the judge favoring text that sounds like its own. On MT-Bench, GPT-4 scored its own answers about 10 percentage points higher in win rate than a blind human would, and Claude-v1 favored itself by around 25 points. The mechanism is not vanity. Arjun Panickssery, Sam Bowman, and Shi Feng showed that a model can recognize its own writing, and that its self-recognition ability rises and falls in near lockstep with how much it prefers its own outputs. Grade your GPT-4 powered product with a GPT-4 judge and you have built a system that rewards itself for sounding like itself.

There is a quieter version of the same problem. Yang Liu and colleagues, building the G-Eval scorer that reached a 0.514 Spearman correlation with humans on summarization, flagged that their LLM evaluator carried a bias toward machine written text over human written text of equal quality. The judge does not just prefer itself. It prefers the whole family of model output it came from.

Build a small human-labeled set and measure agreement

Validating a judge is the same move you would make for any grader you did not write yourself. Get a sample of your real outputs labeled by the person whose judgment the product is supposed to carry, then check whether the judge agrees with that person often enough to trust.

The concrete version. Pull 50 to 100 real turns, weighted toward the moments that decide trust rather than the easy middle. Have your domain expert grade them, pass or fail against the rubric, and freeze that as your calibration set. Run the judge over the same turns and compare. OpenAI's own evals guide tells you to do exactly this, that model grading has an error rate so you validate against human judgment before running at scale, and it packages the check as a meta-eval that holds the judge's choices against human choice labels and reports how often they match. Report raw agreement, and report it as a chance corrected number too. Cohen's kappa strips out the agreement you would get by luck, which matters most on the imbalanced trust moments where guessing pass is right most of the time. The practitioner Eugene Yan reports judges landing in the fair range, kappa 0.3 to 0.5, on hard categorical calls, a useful reminder that raw percent agreement flatters a judge that a stricter metric does not.

For position bias, you do not need labels at all, you need discipline in the harness. Run every pairwise comparison twice with the two answers swapped, and only score a win when the judge picks the same answer both times. Call it a tie when it flips. Zheng's team recommend exactly this, and it converts an invisible bias into an honest tie you can count.

A validated judge still drifts, and people are the bottleneck

The catch is that calibration is a moment, not a property. The judge you validated in March is grading a different rubric by June, because your own standards move as you look at outputs. Shreya Shankar and colleagues named this criteria drift in their EvalGen study. People cannot fully specify what they want until they grade real outputs, and grading changes what they want, so a judge aligned to last quarter's criteria slowly falls out of step with this quarter's. A judge is not something you calibrate once and forget. It is something you re-check on a fresh labeled slice whenever the rubric moves, which puts the human labels back on the critical path you were trying to scale off of. Cheap grading rides on expensive labels.

The cost has a second edge. The strongest way to blunt self-preference and lift agreement is to judge with a more capable model than the one under test, which the OpenAI guide recommends outright. That model costs more per call, and running it twice for position control doubles the bill again. On a live product grading every turn, a judge that is a frontier model swapped in both orders is a real line item, not a rounding error. The honest tradeoff is that a trustworthy judge is slower and more expensive than the naive one, and it never fully removes the human from the loop. It just moves the human from grading everything to grading enough to keep the machine honest.

The judge is part of the product, so test it like one

The clean rule is short. A judge you have not measured against human labels is not a measurement, it is a guess with a decimal point. Treat the grader as a component that ships, with its own agreement number, its own regression check when you change the rubric or the judge model, and its own known biases you control for in the harness rather than hope away. This is the layer under validating the grader, and it is what keeps a suite honest once you have decided that the eval suite is the durable advantage. The failures the judge misses become new labeled examples, the same way a production miss becomes a new eval, so the calibration set sharpens every time the judge gets something wrong.

The open question worth sitting with. Your judge and your product will increasingly be the same family of model, maybe the same model. When the thing being graded and the thing grading it share a bias, human labels are the only outside signal that can catch it. How small can that labeled set get before the judge is grading in an echo chamber, and no one is left to hear it.

Sources and further reading

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Zheng et al., NeurIPS 2023
LLM Evaluators Recognize and Favor Their Own Generations. Panickssery, Bowman, Feng, NeurIPS 2024
Verbosity Bias in Preference Labeling by Large Language Models. Saito et al., 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. Liu et al., EMNLP 2023
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. Shankar et al., UC Berkeley, UIST 2024
Getting Started with OpenAI Evals. OpenAI, OpenAI Cookbook
Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge). Eugene Yan, 2024

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.

Pressure-test my product →See how we work →

Trust the judge only after you check it against people

The only grader that scales is also a model

The judge has a thumb on the scale, three ways

Build a small human-labeled set and measure agreement

A validated judge still drifts, and people are the bottleneck

The judge is part of the product, so test it like one

Related reading

Make the answer checkable

Close the loop from production to eval

Choose and swap models against your own bar

Bring us the hardest moment in your product.