Evals & quality·June 29, 2026·7 min read

Choose and swap models against your own bar

A new model ships every few weeks. Picking one and moving to the next is a re-score on your eval suite, not a leaderboard read.

A new model tops a public benchmark, the launch post is everywhere, and someone on the team asks the reasonable question. Should we switch? Nobody in the room knows yet, because the benchmark measured the model on someone else's problems, not on the turns where your product wins or loses a user. Swap on the launch-day number and you can ship a model that reads a little smoother on average and quietly gets worse at the one moment that mattered, the boundary request, the hard disclosure, the point where your coach decides to push or back off. You find out from the transcripts a week later.

The frontier ships a new model every few weeks, each one wrapped in a chart. Which one you run is a real product decision, and one you can only answer against your own evidence.

Your eval suite is the decision procedure

Here is the plain version of the idea. Choosing a model, and later swapping to a newer one, is an experiment you run on your own eval suite, and the suite decides the winner, not the leaderboard. An eval is a real scenario paired with the behavior you expect and a way to grade whether you got it, so an eval suite is a fixed set of your product's hardest turns with a known answer for each. Treat that set as the procedure that ranks candidate models for your product, and the leaderboard drops back to what it actually is, a way to build the shortlist.

That is the shape Chip Huyen calls evaluation-driven development in her book AI Engineering. You define what better means for your users first, then let that definition drive model choice and every change after it. Public benchmarks narrow the field. Your suite picks the model. Skip the second step and you have outsourced your most important build decision to whichever vendor wrote the best chart that month.

A leaderboard win is not your product

The reason the launch number misleads you is not that labs cheat. It is that the benchmark's distribution, the mix of inputs it scores a model on, is not yours. A model is graded on a fixed public test set, and your product runs on the specific way your users talk, the archetypes you serve, and the moments where your method is unusual. Two models can sit a point apart on a benchmark and behave very differently on the turns you care about, because that benchmark never contained those turns.

There is a sharper version of the problem. Public test sets leak into training data, so a high score can be partly memory rather than skill. Scale AI built GSM1k, a fresh set of grade-school math problems matched to the popular GSM8k benchmark on style and difficulty, then scored every major model on both. Some model families dropped by up to 8 percent on the new problems they had not seen, and the size of a model's drop tracked how often it could reproduce GSM8k examples verbatim, which is what memorization looks like. The public number was inflated for exactly the models that had trained near the test. A leaderboard rank you cannot trace to your own data is a number about someone else's distribution, possibly a contaminated one.

The behavior that decides your product is even further from the benchmark. Sierra's tau-bench measures agents on tool use and policy-following across simulated customer conversations, and it reports pass^k, whether an agent solves the same task on k independent tries rather than once. Even a strong function-calling model like GPT-4o solved only around 61 percent of retail tasks and about 35 percent of the harder airline tasks on the first try, and consistency fell off a cliff, with pass^8 under 25 percent in retail. A single good demo tells you almost nothing about whether the same model holds up eight conversations later, and that gap between one lucky run and reliable behavior is exactly what a launch chart hides and your suite exposes.

Why a swap can be a one-afternoon re-score

Changing the model changes everything downstream of it. That is the same entanglement the regression briefing names, the CACE principle, changing anything changes everything, and a model swap is the biggest anything you can change. A newer model that fixes your tone can quietly reopen a boundary you closed months ago, and you will not see it from a spot-check.

The thing that turns that risk into an afternoon of work is a harness that already exists. If your prompts, your examples, your moment rules, and your eval set are separate, testable assets rather than tangled into one megaprompt, a swap is a re-score. You point the harness at the new model, run the full suite, and read the diff. Nothing gets rebuilt. This is the personality harness that survives the swap, applied to model choice. The eval suite is the durable asset, and the harness is what lets you re-score against a new model without touching the product.

Anthropic's advice on building effective agents is to start with the simplest thing that works and add machinery only when it pays for itself. A clean harness is that machinery. Build it for one model and you have built the test rig for every model after it.

Run the bake-off like an experiment

The move is a head-to-head bake-off scored on your suite. Freeze your eval set before you look at any candidate, so you are grading every model against the same fixed bar and not quietly rewriting the bar to favor the one you like. Run each candidate model through the frozen suite, and score them side by side on the same turns.

Then treat the result like the experiment it is. A raw score bump is not yet a win, because the scores are noisy. Evan Miller, at Anthropic, shows that the right way to compare two models is on their question-level paired differences, grading both on the same eval questions and looking at where they differ turn by turn, rather than comparing two summary averages. Because both models tend to find the same questions easy and the same ones hard, pairing this way cuts the variance for free and tells you whether the new model really beat the old one or just got a lucky roll. Ship on the paired comparison, not the headline average.

One weighting matters more than the rest. Your trust moments are rare, so they barely move the aggregate even when a model fails every one of them. Score those slices on their own, the boundary request, the moment of doubt, the hard disclosure, and weight them heavily in the decision. A candidate that gains a point overall while losing the boundary slice is not an upgrade, it is a regression wearing a better average.

Hold onto the public signal too, in its proper place. Simon Willison has run one fixed prompt, an SVG of a pelican riding a bicycle, against every model he can get his hands on, so his read on a new release carries the weight of a long, consistent comparison no launch post has. That is the useful shape of an outside signal, one tester's fixed test across many models over time. It builds your shortlist. Your suite still names the winner.

Prompt first, fine-tune only when the prompt plateaus

Once a model is chosen, the next fork is how hard you bind behavior to it. Start with prompting and examples, because they are cheap to change and they move with you to the next model. OpenAI's model optimization guide is blunt that prompt engineering is where to start, and that fine-tuning earns its place only when the same failure keeps recurring across many inputs, prompt changes have stopped fixing it, and you have a stable labeled dataset of what good looks like.

Fine-tuning is the heaviest option and it cuts against everything above. It bakes behavior into a specific model, so your investment is now tied to that model rather than carried in a prompt you can re-point. It adds a training and deployment loop to own. And it makes the next swap expensive, because you re-tune instead of re-score. Reach for it when a real, repeated failure proves prompting cannot hold the line, not before. Training too early just bakes in behavior you had not finished working out.

The switching cost is real, so make the swap boring

None of this says chase the newest model. Every swap has a cost even when the harness makes the test cheap. Quality is not the only axis, and a model that scores a touch higher on your suite can cost more per call or answer slower, and for a conversational product latency is felt, not just billed. A better score that doubles your response time may be the wrong trade. Weigh cost and latency alongside the eval result, not after it.

There is a switching cost in the swap itself too. New failure modes you have not characterized, prompt phrasings the old model forgave and the new one does not, a fine-tune to redo if you went that route. The team that swaps on every launch pays that tax constantly. The team that never swaps slowly falls behind a cheaper, better frontier. The suite lets you hold the middle, because it turns should we switch from an argument into a measurement.

So the rule is small. Keep the harness and the eval suite as the assets you own, freeze the set, and make every candidate earn the swap on your bar, with the trust moments weighted and the noise cleared. Model access is a commodity and a fresh one lands every few weeks. What tells you whether the newest one is actually better for your users is the one thing a leaderboard can never contain, your own product's hardest turns with a known answer for each.

Sources and further reading

AI Engineering: Building Applications with Foundation Models. Chip Huyen, O'Reilly, 2025
A Careful Examination of Large Language Model Performance on Grade School Arithmetic (GSM1k). Zhang et al., Scale AI, NeurIPS 2024
tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. Yao, Shinn, Razavi, Narasimhan, Sierra, 2024
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. Evan Miller, Anthropic, 2024
Building Effective Agents. Anthropic, 2024
Model optimization. OpenAI, OpenAI API docs
Pelican riding a bicycle (LLM comparison writing). Simon Willison, 2024 to 2026

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.

Pressure-test my product →See how we work →

Choose and swap models against your own bar

Your eval suite is the decision procedure

A leaderboard win is not your product

Why a swap can be a one-afternoon re-score

Run the bake-off like an experiment

Prompt first, fine-tune only when the prompt plateaus

The switching cost is real, so make the swap boring

Related reading

Make the answer checkable

Close the loop from production to eval

Trust the judge only after you check it against people

Bring us the hardest moment in your product.