A guidance product spends most of its life on easy turns. Then a user types the one sentence that actually matters. They mention they have not eaten in three days. They ask the coach whether they should stop taking the medication their doctor prescribed. They disclose something a mandated reporter would have to act on. There are two ways to fail this turn, and plenty of products manage both at once. Answer with a confidence the product has no business having. Or freeze, paste a disclaimer, and leave the person alone with the hard thing. The boundary is the part of the product that handles those turns, and it is usually the part nobody designed.
Three moves, not one switch
Refusal and escalation are features, the same as any other behavior the product ships. It helps to pin the words. A refusal is the product declining to answer. A safe completion is a bounded answer that stays inside the line, the high-level response a user can act on without the dangerous specifics. An escalation is the handoff, routing the turn to a person or a crisis resource the product is not allowed to replace. A real safety design decides which of the three a given moment gets, and how it is worded. A disclaimer decides nothing.
Why the switch fails both ways
The instinct is a switch. Spot a risky topic, block the answer. It fails in both directions at once. Block too little and the model answers what it should not, because a model does not hold a boundary on its own. Block too much and you break the product for the people it was built for. That second failure is measurable. Over-refusal, declining a safe request because it pattern matches a dangerous one, is what the XSTest suite was built to catch. Across its 250 safe prompts, the kind a calibrated model should answer, Llama 2 flatly refused 38 percent and partly refused another 21.6 percent. 'Where can I buy a knife for cutting bread' reads like a weapons question to a nervous classifier.
The deeper issue is that 'is this input dangerous' is the wrong question. OpenAI rebuilt its refusal training around that point. Its safe-completions work scores the safety of the model's output rather than sorting the user's intent into safe or unsafe, so the model answers as helpfully as the policy allows instead of choosing between comply and refuse. The Model Spec now names refusal and safe completion as two separate sanctioned moves, which is the admission that a hard block is not the only safe answer.
Build the boundary like a behavior
First, write the boundary as small, checkable rules, not one paragraph that says be safe. DeepMind built Sparrow with 23 specific rules, informed by experts, and graded them one at a time with a separate model that flags which rule a turn broke. Small units are why you can say where the boundary failed instead of rereading the whole prompt. Even after that work, testers still got Sparrow to break a rule 8 percent of the time, which is the honest number a boundary should keep about itself.
Second, prefer a safe completion to a hard refusal wherever the policy allows one. A user asking about a drug interaction can get the high-level "that combination can be dangerous, here is who to ask" instead of silence. Staying helpful inside the line is the win. This is the boundary work the method briefing names as a feature, taken down to the mechanism.
Third, design the handoff for the cases the product must not hold. In high-stakes domains this is the standard, not a courtesy. The APA's 2025 health advisory tells builders of mental-health tools to ship tested crisis-escalation pathways that route a user in crisis to human-led services, and it names the 988 Suicide and Crisis Lifeline, the US line that launched in July 2022. NIST files the same idea under a named risk it calls human-AI configuration in its Generative AI Profile, a person in the loop at the high-stakes step, treated as a control you design rather than a nicety. A handoff is a feature with a trigger, a destination, and a path you have actually tested.
Every dial cuts both ways
Pretending the settings are free is how you ship the brittle version. Turn safety up and you pay the over-refusal tax, that 38 percent, a product that flinches at 'kill time' and loses the user it was for. Wire a crisis flag that fires on a song lyric or a movie plot and people learn to route around it, and a boundary people route around is worse than none. None of these dials is set once, and a better model does not settle them. Models still cross the line under pressure, jailbreaks still land, and a model swap that fixes one boundary can quietly reopen another, which is why the boundary lives in the eval suite that survives the swap, not in a prompt you hope holds.
Decide who takes the hard turn
The boundary is a behavior with three settings, refuse, complete safely, or escalate. Spec it, test it on every change the way you test the rest, and watch the turns where it fires. Instrument the boundary request on its own, the way you would any trust moment, because the rare turn that decides trust barely moves the average even when it fails every time. The teams that get this right decide who handles the hardest turn before a user brings it, and they can show the handoff works. The ones that skip it find out from the transcript.
Sources and further reading
Work with Hunter Green