Release discipline·April 16, 2025·3 min read

Catch the regression before your users do

A prompt tweak that helps one case can quietly break ten others. Gate every change.

The most dangerous change is the one that fixes the case in front of you and quietly breaks ten you weren’t looking at. With language models, that happens constantly: a prompt nudge to improve tone erodes a safety boundary three archetypes away.

Spot-checks don’t scale

Reading a few transcripts after each change feels responsible. It isn’t. On a busy day a hand-check skims right past the candidate that diagnoses, or deflects, or drifts. Manual review can’t cover the surface area, and it can’t catch what it isn’t looking for.

Gate every change

Run the full eval suite against every candidate, automatically, and block on regressions. The bar isn’t “the new version looks good,” it’s “the new version beats the old one and breaks nothing that mattered.”

A/B, then monitor

Compare the candidate to the baseline across the set, ship only when it wins, then watch production to confirm the win held. The failure you find in production becomes a new eval, and the loop closes a little tighter each time.

Newsletter

New thinking on building AI guidance products.

Essays on evals, observability, and shipping conversational guidance you can trust, straight to your inbox.

Subscribe on Substack →