Personality & character·June 26, 2026·8 min read

A prompt imitates a personality. A harness governs one.

A system prompt sets the tone in a demo. Holding that character once real users arrive is an engineering problem.

Part ofSteering the personality of a conversational tool →

A team ships a conversational product, and the system prompt is good. Be warm, be concise, sound human, don't stack questions, don't lecture. In the demo the agent is alive. Then real users arrive, and the same agent is warm in one chat and stiff in the next. It asks three things at once right after someone finally opens up. It ends every turn with a coaching question nobody asked for. The team does the natural thing and adds more rules. The prompt gets longer. The behavior does not get more reliable.

What a personality prompt actually controls

A personality prompt does real work. It moves tone, sentence length, and word choice, and it decides whether the thing reads like a coach or a concierge. Choosing that character on purpose, instead of taking the default the training hands you, is the right first move, and a system prompt is how most teams make it. That is why it feels like control. You change a line, the voice changes, and it shows up on the next turn.

The trouble is what a prompt can't do. It can say "be concise," but it can't decide when concision is the wrong call, like when a user needs room. It can say "sound human," which says nothing about which human behavior this moment needs.

That gap is where the product breaks. A prompt describes the behavior you want. A harness is what makes that behavior happen, turn after turn. The fix people reach for, a longer prompt, treats a system problem as a wording problem. What closes the gap is a personality harness, the examples, the moment rules (how the character should shift at each recurring stage of a conversation), the output checks, and the architecture that make an agent's character observable and testable, so you can improve it on purpose. The shift is from what should it sound like to what makes this behavior repeatable when the situation changes.

Character is turn-level judgment

The reason is structural. A conversation runs as a sequence of turns, and each turn changes the problem. The same user is guarded on turn three and ready to act on turn fifteen. A character that behaves the same way through all of that misses the room.

A supportive agent that always validates stops being supportive. A curious agent that always asks a question turns every exchange into an intake form. The trait the prompt names is real, but it has to come out differently depending on where the conversation sits. That decision is the part a prompt can't hold.

The useful move is to make the decision explicit. Before the agent writes anything, it should know what kind of turn this is. Call it a move, the one thing the response is trying to do, whether that's to reflect, to clarify, to challenge, or to close. A model that picks no move tends to attempt all of them at once, which is where the stacked questions and the bloated coaching answers come from. If the move is reflect, the turn may not need a question at all. If the move is close, it should end, not manufacture another prompt to keep the user talking.

This is not new in dialogue systems. DeepMind shaped Sparrow with rules its raters could grade one at a time, a turn and a rule at a time rather than one global instruction. Behavior gets reliable when it's governed in small, checkable units, not described once at the top.

A longer prompt is the wrong fix

When character slips, the reflex is to add a rule. It helps less than it should, and there's a mechanical reason. Liu and colleagues found that models use long inputs unevenly, leaning on what sits at the very start and the very end of the context and losing the material in the middle, a pattern they named lost in the middle. The rule you bury on line 240 to fix last week's failure is sitting in the worst possible spot. Anthropic's own prompting guidance points the same way, telling builders to put long reference material near the top and the instruction near the end, because position changes how much the model attends to a thing.

There's a second reason piling on identity language disappoints. Telling a model who to be does less than it looks. When researchers tested persona prompts across thousands of factual questions, adding a role to the system prompt didn't reliably improve the answers, and sometimes made them worse. So "act like a world-class therapist" is a label, not the judgment you were trying to ship.

Build the harness from the failures you see

A harness is a small set of parts, each one the home for a kind of failure, so that when something breaks you know where to fix it instead of editing the megaprompt and hoping.

Start by turning taste into tests. "Don't be verbose" is not checkable, so it doesn't hold. "Default to one to three sentences" is. "Don't stack questions" is vague. "At most one question mark, and no questions joined by 'and' or 'or'" is something a script can catch. Most of what a team calls voice can be written as a constraint you can observe, which means you can enforce it.

Then make examples carry the weight words can't. "Warm" means little on its own. Five real exchanges showing warmth in a hard moment mean a lot, and a stack of bad-to-good rewrites of stacked questions teaches the pattern better than any rule about them. This is few-shot prompting, steering the model with input and output examples instead of retraining it, and OpenAI's prompting guide recommends showing a range of cases, not one tidy one. Treat the example set as a living asset, not a one-time appendix to the prompt.

When a specific failure won't die, separate the decision from the wording. A small, fast step can label the turn, the user's state, the move, and whether a question is allowed, before the main model writes the reply. Or a critic pass can check the draft against the hard rules and send it back when it stacks questions or runs long. Reach for this only when a failure proves you need it. Anthropic's guidance on agents is blunt that the implementations that work use simple, composable pieces rather than heavy frameworks, and that every added part has to earn its cost in latency and money.

None of it is real until you can measure it. The same discipline a team already uses for correctness applies to character. A repeatable set of test turns paired with the behavior you expect, run on every prompt and model change, turns "the agent feels off" into "it used the wrong move on turn four." That is how personality stops being a vibe review and becomes a test you can run on every change, the asset that survives a model swap.

When not to build it

The failure mode on the other side is building the whole apparatus before you've earned it. A moment map with fifteen states and a taxonomy of twenty moves, drawn up before anyone has read a transcript, is theater. It looks like rigor and predicts nothing. The honest starting point is small. Map the five or six moments where your transcripts actually break and write the moves those moments need, then grow it only when real failures ask for more.

The objection worth taking seriously is the model one. Why build any of this when the next model will hold a character better on its own? A better model is a re-roll, not a guarantee. A version that fixes your tone problem can quietly reopen one you closed months ago, because changing the model changes everything downstream of it. The harness is the artifact that survives the swap. The prompt, the examples, the moment rules, and the evals are yours. They get re-scored against the new model and tell you in an afternoon whether it held the line. Without them you're back to reading transcripts and hoping.

Voice raises the stakes again. A text harness still helps, but it doesn't reach the parts of voice that carry personality, when the agent starts talking and whether it lets itself be interrupted. A voice agent is a pipeline, speech to text, then the model, then text to speech, with a turn-taking layer deciding who holds the floor. Character lives in that timing as much as in the words, so if you own a voice product, know which of those pieces you can configure before you promise a personality on top of them.

What a harness lets you ask

The real difference shows up in the questions a team can ask when something goes wrong. The team stops asking why it feels off and starts asking a chain it can act on. Which moment was this? Which move should it have made? Which rule was missing, or buried too deep to fire? Which example would have caught it? Can we turn it into an eval so it can't ship twice? And where does the fix belong, the global prompt, the moment rule, the examples, the classifier, the critic, or, for voice, the turn-taking layer?

A prompt can imitate a personality. A harness governs one. Teams stuck at the prompt keep answering every failure with more words. Teams with a harness answer with a fix that has an address.

You don't need much to start. A one-page character spec, five or six moment rules, a short list of moves, thirty or so examples (the bad-to-good rewrites teach the most), a checklist of hard guards, and a small eval set will carry a product a long way. Fine-tuning can come later, once you know what good looks like, and training too early just bakes in the behavior you hadn't finished working out. The aim is the smallest system that holds the character steady, and one prompt was never going to be it.

Sources and further reading

Lost in the Middle: How Language Models Use Long Contexts. Liu et al., TACL, 2024
Long context prompting tips. Anthropic, current
When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models. Zheng et al., Findings of EMNLP, 2024
Playing Pretend: Expert Personas Don't Improve Factual Accuracy. Wharton Generative AI Labs, 2025
Prompt engineering. OpenAI, current
Building Effective Agents. Anthropic, 2024
Improving alignment of dialogue agents via targeted human judgements (Sparrow). Glaese et al., DeepMind, 2022
Conversational AI overview. ElevenLabs, current
Building voice agents. LiveKit, current

Reflective Surfaces

What makes a conversation actually good.

The questions that do not fit in an eval. What makes a conversation land, and why trust is so hard to measure. New writing, in your inbox.

Subscribe on Substack →