Topic

Safety and boundaries

The turn that decides trust is often the one the product should not answer alone. Refusal, escalation, and holding the line under adversarial pressure, designed and tested like any other behavior.

Our take

A guidance product spends most of its life on easy turns, and then a user types the one sentence that actually matters. The boundary is the part of the product that handles those turns, and it is usually the part nobody designed. Treat it as a behavior you build and test, not a disclaimer you paste at the end.

Design the boundary

Refusal, a bounded safe answer, and a handoff to a person are three settings of one behavior, and a real safety design decides which a given moment gets. Get it wrong in either direction, an answer it had no business giving or a reflexive block on a safe question, and the person stops trusting everything around it. The labs write these as small, checkable rules and grade them one at a time.

Then defend it

The boundary is also the part adversaries push on hardest, through jailbreaks that talk the model past its training and injected instructions hidden in the content it reads. No defense here is ever finished, so the honest posture is defense in depth and a standing red-team aimed at the trust moments, with every failure you find becoming a permanent eval.

Read in order

Each piece builds on the one before it.

Sources and names to follow

The primary work this topic is built on, and the people pushing it forward.

Worth following

What makes a conversation actually good.

The questions that do not fit in an eval. What makes a conversation land, and why trust is so hard to measure. New writing, in your inbox.

Subscribe on Substack →

Safety and boundaries

Our take

Design the boundary

Then defend it

Read in order

When not to answer, and when to bring in a person

Jailbreaks are a tax you keep paying

Prompt injection is the attack you cannot fully patch

Red-team the product before your users do

Sources and names to follow

What makes a conversation actually good.