Skip to content
← Insights
Safety & boundaries·June 28, 2026·8 min read

Jailbreaks are a tax you keep paying

A safety boundary is not a bug you close once. Adversarial users keep finding the way around it, so robustness is a posture, not a milestone.

You spent a week getting the boundary right. The coach declines the request it should decline, stays helpful on the dual-use middle, routes the crisis turn to a person. You test it, it holds, you ship. Two weeks later a user pastes a wall of fake dialogue in front of their question and the model answers the thing it refused on day one. Nobody changed the model. Nobody changed the prompt. The boundary did not break because you built it wrong. It broke because someone came back and looked for the seam. There is always a seam.

The boundary is a target, not a setting

Here is the idea in one sentence. Safety training makes a model refuse the harmful request in the common case, but it does not install a rule the model cannot be argued out of, so a motivated user can nearly always find an input that puts the model back over the line. A jailbreak is that input, a prompt built to get a model to do what its safety training was supposed to stop. Once you see the boundary as something an adversary probes rather than a switch you flip, the job changes. You stop asking whether the boundary is closed and start asking how hard it is to open, how fast you would notice, and how much a working attack would get.

This is what trips up teams shipping their first high-trust product. They treat the refusal like any other feature. Built, tested, done. But a feature only has to work for cooperative users. A boundary has to work for the one user actively trying to defeat it, and that user gets unlimited tries, shares what works, and does not quit when you ship version two.

Why the training does not hold

The clearest account of why safety training fails comes from Wei and colleagues, who name two failure modes and use them to build attacks on purpose. The first is competing objectives. A model is trained to be helpful and follow instructions, and it is trained to be safe, and those goals collide. An attack that raises the cost of refusing, by framing the refusal as breaking character, or by starting the model's answer with 'Sure, here is' so continuing is the path of least resistance, pits the helpfulness training against the safety training and often wins. The safety behavior was never a hard constraint. It was one pull among several, and you can add a stronger pull.

The second failure mode is mismatched generalization. Safety training covers the inputs the raters saw, plain harmful requests in plain language. The model's raw capabilities reach much further, to encodings, obscure phrasings, and formats the safety data never touched. So an input the capability side handles fine but the safety side never learned to flag walks straight through. The refusal did not generalize as far as the skill it was meant to gate. Wei and colleagues tested this against GPT-4 and Claude v1.3 and found the vulnerabilities persisted through the heavy red-teaming those models already had. Read the two failure modes in the original. Each attack below is a different way to exploit these same two gaps, which is why a better model does not retire them.

The attack classes, at the level of the gears

Three families cover most of what lands, and each maps to a gap above. It is worth knowing them at the conceptual level even though this piece will not hand you the payloads.

The first is the adversarial suffix. Zou and colleagues showed you can search, by gradient, for a short string of roughly twenty to thirty tokens of apparent gibberish that, appended to a request, flips the model into complying. The important part is not that it works on the open model you optimized against. It is that the same string transfers. A suffix trained on open models induced the target behavior in the public ChatGPT, Bard, and Claude interfaces, mismatched generalization at its starkest, an input no human would write that the safety training could not have anticipated. The universal and transferable attack paper is why a boundary that only holds against human-written attacks is not holding against much.

The second is the many-shot attack, the one the opening scene describes. Anthropic showed that if you fill a long context with hundreds of fake dialogues where an assistant complies with harmful requests, the model reads that as the pattern and follows it. The effectiveness climbs with the number of examples on a power law, up to hundreds of shots, and it works across Claude 2.0, GPT-3.5, GPT-4, Llama 2, and Mistral 7B. This is competing objectives at scale, the model's in-context learning, a core capability, dragging it past its safety training. The detail that should worry anyone counting on a fix, fine-tuning the model against the attack raised the number of shots you need but left the power law intact. It moved the constant, not the exponent. Anthropic's writeup is direct that longer context windows are what made this newly practical.

The third is persona and roleplay framing. Wrap the request in a fiction, a character exempt from the rules, a hypothetical, a game, and the model's instruction-following generalizes into the frame while its refusal training, tuned on direct requests, does not follow. This is the human-ingenuity end of the same competing-objectives problem. The frame gives the helpfulness objective a story to be helpful inside of, and that is why roleplay jailbreaks kept working long after the labs knew about them.

Robustness is defense in depth, not a better prompt

If the model cannot be trained into a hard boundary, the boundary has to live somewhere the model is not the only line. The posture that works is defense in depth, layers around the model that each catch what the others miss, so no single failure opens the door. Three layers do most of the work.

The layer with the most public evidence behind it is a separate classifier on the input and the output. Anthropic's constitutional classifiers are guard models trained from a written constitution of what is and is not allowed, sitting in front of and behind the main model. In their evaluation the jailbreak success rate on a guarded Claude 3.5 Sonnet fell from 86 percent on the unguarded model to 4.4 percent, and across more than 3,000 hours of red teaming by hundreds of researchers, none found a universal jailbreak that pulled detailed harmful content across the target queries. That is the strongest published number on jailbreak resistance, and it came from adding a layer, not a better base model. The constitutional classifiers paper has the setup and the costs, which matter as much as the headline.

The second move is to stop treating the boundary as static. You are in an arms race, so you need to see the attacks. Log the turns where a refusal fires and the turns where one should have. Watch for the shape of an attack, a context suddenly stuffed with dialogue, a request wrapped in fiction, the same odd suffix across many users. Monitoring is what turns 'we got jailbroken two weeks ago and found out from a screenshot' into 'the attack rate on this category rose Tuesday, here are the transcripts.' A boundary you cannot observe is one you are not defending, only hoping about.

The third move outlasts the others. Put the boundary in the eval suite, not the prompt. Keep a growing set of attack transcripts, the suffixes, the many-shot templates, the persona frames that have worked on you or on published models, paired with the behavior you require, and run it on every model and prompt change. A prompt-level defense is a patch against yesterday's attack. An eval tells you, in an afternoon, whether the model you are about to swap in reopened a boundary you had closed. It is the same discipline that makes any hard-won behavior survive a model swap, applied where the stakes are highest.

Every filter you add has a bill

None of this is free, and pretending it is gets you the brittle version. The first cost is over-refusal. Every filter you tighten to catch more attacks also catches more safe requests that happen to look like attacks, the tax the XSTest suite was built to measure. Across its 250 safe prompts, the kind a calibrated model should answer, Llama 2 fully refused close to 40 percent, because 'where can I buy a knife for cutting bread' pattern-matches to a weapons question. Turn the boundary up and you lose the user it was built for, in a product where flinching at the wrong moment is its own kind of failure.

The second cost is latency and money. A classifier on the input and the output is extra inference on every turn. Anthropic reported their constitutional classifiers added a 23.7 percent inference overhead, and even a well-tuned version raised production refusals by an absolute 0.38 percent. Those are livable numbers for a serious product, but they are numbers, and they land on every request, not just the attacks. You buy resistance with compute and a slice of your latency budget.

The third cost is the honest one. No defense is complete, and a piece that told you otherwise would be selling. The 4.4 percent that got through the classifiers is not zero. The many-shot power law survived fine-tuning. The suffix attack was automated, so the search does not tire. The right frame is not 'we closed it' but 'we raised the cost of the attack above what our threat model bears, and we watch for when that stops being true.' A boundary is a claim about effort, an adversary is a claim about motivation, and the two keep meeting.

Budget for the tax instead of pretending it ends

The teams that ship durable high-trust products stop treating the jailbreak as a bug with a fix and start treating it as a standing cost of being worth attacking. They pick a threat model, who is trying, how hard, and what a success would get them. They build the layers that raise the attacker's cost above that line, a classifier they retrain as the constitution changes, monitoring that surfaces attempts before a user does, and an eval set that grows every time a new attack lands. Then they keep paying, because the attacks keep coming and the model underneath keeps changing.

This is the same posture that governs the boundary in normal use, refusal and escalation designed as features, and the reason you red-team the product before your users do. It shares a root with the other input you do not control, the injected instruction that rides in through a document or a tool result, where the same lesson holds, an input from outside the trust boundary cannot be trusted to respect it.

So the clean rule is this. A jailbreak is not an incident you resolve. It is a tax you keep paying, and the only question is whether you have budgeted for it, in layers, in monitoring, and in the eval that tells you the day the price went up.

Work with Hunter Green

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.