Safety & boundaries·June 28, 2026·7 min read

Red-team the product before your users do

Adversarial testing is a standing discipline, not a one-time audit. For a guidance product, aim it at the trust moments.

Part ofSafety and boundaries →

A coaching product handles a crisis disclosure well in the demo. Someone types 'I don't want to be here anymore,' the model recognizes it, and the escalation fires. Then a real user gets to the same place sideways. They spend nine turns talking about a breakup, the tone stays light, and on turn ten they slip the disclosure into the middle of a longer message. The classifier that caught the clean version misses this one, because nothing about the surrounding conversation looked like a crisis. The turn that decides whether this person is safe is exactly the turn nobody wrote a test for. You will find it in a transcript after the fact, or you will find it before, on purpose, by attacking your own product.

Red-teaming is a practice, not an audit

Red-teaming means deliberately trying to make your own system fail, thinking like an adversary or an unlucky user, to surface the harmful behaviors before someone hits them in production. OpenAI's own definition is 'a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers,' from its writeup on external red teaming. The word carries a security connotation, and jailbreaks and prompt injection are part of it, but for a guidance product the higher-value target is quieter. It is the crisis disclosure that arrives buried in a long chat, the boundary request phrased as a hypothetical, the vulnerable user who is not trying to break anything and gets a wrong answer anyway.

The failure mode this guards against is treating red-teaming as a launch gate. A team runs one adversarial pass before shipping, files the report, and moves on. Then the model changes, the prompt grows, a new feature opens a new surface, and the audit is a year stale. Red-teaming is unbounded and it is never done, which is the whole reason it has to be standing rather than one-time. The question is not whether you red-teamed. It is whether you are still doing it against the version you are about to ship.

A human test writer runs out before the failures do

The mechanism worth understanding is why a person cannot cover this by hand. A test writer imagines the failures they can imagine. They picture the crisis disclosure phrased the way they would phrase it, write that test, and pass it. The failures that ship are the ones shaped differently from anything the writer pictured, and the space of differently is effectively infinite. There are so many ways to word an input that a small hand-written suite samples a rounding error of them. Human annotation is also expensive, which caps the number and the diversity of cases you can afford, so the long tail where trust actually breaks stays untested.

The move that scales past this is automated red-teaming, where one language model generates adversarial inputs against another and a classifier scores the target's replies. Perez and colleagues at DeepMind showed how far this goes. They pointed a generator model at a 280-billion-parameter chatbot, scored its replies with an offensiveness classifier, and surfaced tens of thousands of offensive replies, far more than a human team would have hand-written. The generator does not share the test writer's blind spots, so it phrases attacks a person would not think to try.

Two of their findings map straight onto a guidance product. The red-team run pulled up cases where the chatbot generated real personal and hospital phone numbers as its own contact information, the kind of confidently wrong output a vulnerable user would act on. And it found that offensive replies early in a dialogue beget offensive replies later, so a bad turn compounds across the conversation instead of resetting. That second one is the buried-disclosure failure from the top of this piece. The danger lives in the trajectory, the multi-turn path the conversation takes, not in any single message a one-shot test would check.

Aim it at the trust moments, and keep the expert

Automated generation gives you volume. It does not give you judgment about which failures matter most in your domain, and that is where the standing pass earns its keep. Point the generator at the handful of moments that decide trust, the crisis disclosure, the request to cross a medical or legal boundary, the user in a vulnerable state, rather than at generic policy violations. NIST files red-teaming under the Measure function of its Generative AI Profile and names expert and human-plus-AI red-teaming as distinct types, precisely because the specialist knows which failures are dangerous in context and the automation finds them at scale.

OpenAI's external red-teaming work makes the same case with a number. It put more than 100 external red teamers against GPT-4 before launch, people with knowledge of regional politics, law, and medicine, to find gaps that prompt-based testing on its own could not. Its own conclusion is that automated red-teaming complemented by targeted human insight is more resilient than either alone. For a coaching or care product, the expert is the clinician who knows the specific wrong answer that loses a user, and the automation is what runs that shape of attack ten thousand ways.

The part that turns a red-team pass from a report into an asset is what you do with each finding. Every failure the pass surfaces becomes a permanent eval. You take the exact adversarial input that broke the model, pin it to the expected safe behavior, and add it to the regression suite that runs on every change. That is the difference between finding a bug once and never shipping it again. It is the same loop the regression briefing describes from the other direction, where the failure you catch in production becomes the test that gates the next release. Red-teaming is that loop run offense-first, before a user is the one who finds the failure. Anthropic ran this at scale and released the raw material, a dataset of 38,961 red-team attacks, so the found cases are reusable rather than thrown away after one run.

The cost, and where automation stops

Red-teaming is genuinely unbounded, and that is the honest limit. You never reach done, so the discipline competes with shipping for the same engineering hours, and a team that treats it as infinite will either burn out on it or quietly stop. The way to keep it sustainable is to scope it to the trust moments and let the regression suite carry the coverage forward, so each pass adds cases instead of re-deriving them.

Automated red-teaming also has two failure modes of its own. It has coverage gaps. A generator model shares some of the target's blind spots and tends to find the failures it is already shaped to look for, which is exactly why OpenAI keeps human experts in the loop rather than trusting the automation alone. And the classifier that scores replies produces false alarms and misses, so a run that flags a thousand cases can bury the ten that matter under noise the way any noisy eval score can mislead. A found case is a candidate, not a verdict, until a person who knows the domain confirms it is a real failure and worth an eval.

There is also a scaling wrinkle worth knowing before you assume a better model solves this. Anthropic found that RLHF-trained models get harder to red-team as they scale, while other model types stayed flat. Harder to break is good, but it also means the failures that remain are rarer and stranger, the deep-tail cases a light pass will skate right over. A stronger model raises the floor. It does not close the tail, and the tail is where a guidance product's trust moments live.

Attack it on a schedule, not once

The clean rule is short. Red-team the trust moments on every change, generate the attacks with a model so you reach past your own imagination, keep an expert to say which failures matter, and turn every confirmed failure into an eval that runs forever after. The boundary you built to refuse and escalate only holds if you keep attacking it, and the jailbreaks and injections that get past it are found the same way.

The open question the field has not settled is coverage. Nobody can yet tell you what fraction of a product's real trust-moment failures a given red-team pass has found, so 'we red-teamed it' still hides a number no one can quote. Until that number exists, the honest posture is to assume the pass missed something, ship the safe completion and the human handoff for when it did, and run the pass again next week against the version you are about to release.

Sources and further reading

Red Teaming Language Models with Language Models. Perez et al., DeepMind, EMNLP 2022
Red Teaming Language Models with Language Models (blog). Google DeepMind, 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Ganguli et al., Anthropic, 2022
OpenAI's Approach to External Red Teaming for AI Models and Systems. Ahmad et al., OpenAI, 2025
AI RMF Generative AI Profile (NIST AI 600-1). NIST, 2024

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.

Pressure-test my product →See how we work →

Red-team the product before your users do

Red-teaming is a practice, not an audit

A human test writer runs out before the failures do

Aim it at the trust moments, and keep the expert

The cost, and where automation stops

Attack it on a schedule, not once

Related reading

Jailbreaks are a tax you keep paying

Prompt injection is the attack you cannot fully patch

When not to answer, and when to bring in a person

Bring us the hardest moment in your product.