Start here
Enter your email to begin
Before you start
Set your upfront decisions
Most teams start personality design with a prompt. It usually reads something like this. "Be warm, concise, helpful, emotionally intelligent, and human. Do not ask too many questions. Do not sound robotic."
That is a reasonable start. It is not a production control system. A personality prompt describes how the agent should sound. A personality harness defines how it should behave, how that behavior shifts by context, how examples steer it, how failures are caught, and how engineering can build and maintain it.
This workshop moves a team from personality as a block of text to personality as an operational artifact. You work through it together and answer in the worksheet as you go, so you finish with something product and engineering can act on.
What you will produce
By the end of the session you will have a marked-up version of your current personality prompt, a one-page voice card, a starter set of conversational moves, a starter moment map, a small set of bad-to-good examples, a guard and check list, a list of open questions, and a first engineering backlog. The aim is not to solve personality in one sitting. It is to make personality observable, testable, and buildable.
Who should be in the room
The session works best with a product owner, an engineer or technical lead, a designer or conversation designer, a domain expert, and a trust or quality owner where it is relevant. Assign three roles before you start. One person facilitates, one person scribes, and one person acts as engineering translator. The translator keeps asking where each decision would live in the system, and whether it is a prompt change, a runtime variable, a retrieved example, a classifier, a critic, an eval, a logging requirement, or an unresolved product decision.
What to bring
Bring your current personality prompt and system prompt if you have them, five to ten real or realistic user turns, a few responses that felt wrong, a few that felt right, and any product or architecture constraints you already know. If you do not have transcripts yet, set the transcripts decision to synthetic above and label your imagined turns clearly. Do not pass them off as evidence.
How the session runs
Set your upfront decisions first. They tell the modules what to assume and what you are deferring, and the guidance adjusts to match. Then work through each module in order. Every discussion should produce one of six things. A prompt rule, a moment rule, an example, a check, a backlog item, or an open decision. If it produces none of those, it is taste. Taste is fine for a few minutes, then it becomes an artifact or it moves on.
What this prevents
The session should not end with "make it warmer" or "sound less robotic." Those are not instructions anyone can build. It should end with statements engineering can act on, such as "limit generation to one question per turn," "add emotional disclosure as a moment state," or "log the selected moment, the selected move, the generated response, and any guard failures." That is the shift from a personality prompt to a personality harness.
Set the frame
Open by naming the goal. You are not here to write the perfect personality prompt. You are here to break personality into pieces product and engineering can use.
Agree on one rule for the session. No adjective is finished until it is translated into observable behavior. "Warm" is not finished, but "reflect before advising in emotional disclosure" is. "Concise" is not finished, but "one to three sentences by default" is. "Curious" is not finished, but "ask one narrow question only when information is missing" is.
Your worksheet
Decompose the current prompt
Paste your current personality prompt below. Tag each instruction as one of these. Global voice, surface style, turn-level behavior, moment-specific behavior, safety or policy guard, product goal, user preference, example, architecture requirement, or unclear.
When an instruction could sit in more than one category, do not debate it. Record it in both, mark it unresolved, and move on. "Do not overwhelm the user" might mean keep responses short, ask fewer questions, slow down during emotional disclosure, or hold back multi-step plans unless asked. That ambiguity is the finding, and it needs a definition before anyone can build it.
For each instruction worth keeping, note its likely category, the observable behavior behind it, where it would live, and any open question.
Your worksheet
Identify known failures
Ask where the agent currently feels wrong, and use real examples wherever you have them. Common patterns include too many questions, stacked questions, over-explaining, advice before understanding, a generic tone, a tone that does not shift with the moment, and pushing when the user is done.
If you have no real failures yet, use predicted ones and label them as assumptions. Do not let an assumed failure get treated as proven.
For each failure, record an example user turn, the bad response pattern, why it fails, and the likely cause. Prompt, moment, move, example, architecture, or eval. Rank the top five. They drive the rest of the session.
Your worksheet
Build the global voice card
The voice card defines stable identity. Keep it short. It is not the place for every behavior.
Fill in what the agent is and is not, how users should and should not feel, how it usually sounds and what it should avoid sounding like, the roles it should never play, and the phrases it should never use.
If you stall, work from contrast. What would obviously be wrong? What would make a user lose trust? What would sound like every other chatbot? The anti-voice is often easier to name than the voice, and it is just as useful.
Stable identity belongs in the system or developer prompt. A list of banned generic phrases can become a detector or a rewrite rule. Higher-level principles can become eval criteria once you write them as observable behaviors.
Your worksheet
Translate adjectives into rules
Take the voice card and turn its adjectives into checks. For each one, write what it means in behavior, what it does not mean, a good and a bad response pattern, whether it can be checked automatically, and where the check would live.
"Warm" means acknowledge the user's state before moving on. It does not mean adding reassurance to every response. "Concise" means one to three sentences by default. It does not mean clipped. "Curious" means one narrow question only when information is missing. It does not mean ending every turn with a question.
If you cannot finish the sentence "we would know the agent was being X if we saw it do Y in a transcript," the trait is not ready to build yet.
Your worksheet
Define conversational moves
A model should not just respond. It should make a move. Pick five to eight. Common ones are reflect, clarify, challenge, normalize, summarize, explain, invite action, close, redirect, and escalate.
For each move, define its purpose, when to use it and when not to, a default length, a question policy, and a good and bad example. If you stall, return to the failure list. If the agent asked three questions, the move it should have made was probably reflect. If it gave advice too soon, the move was probably normalize. If it kept going, the move was probably close.
Move labels can become classifier labels, the selected move can be injected into the generation prompt, and move fit can become an eval dimension.
Your worksheet
Build a starter moment map
The same personality should not behave the same way in every moment. Pick five to seven. A workable default is opening, exploration, emotional disclosure, decision or action, and closing, with a safety-sensitive override if you need one.
For each moment, define the user state, the agent goal, the allowed and disallowed moves, the tone and length, the question policy, and what to log or evaluate. Do not overbuild. Five moments you can implement beat ten you will argue about.
Moments can become runtime state, classifier output, or retrieved prompt context. Moment cards can be injected dynamically, and moment fit can be evaluated.
Your worksheet
Write bad-to-good examples
Examples are the most important part of the harness. Use the failures you ranked earlier. For each, write the user turn, the moment, the desired move, a bad response, a better response, why the better one works, and a failure tag.
If you stall, write the bad response first. Teams are usually better at naming what feels wrong than at inventing the ideal from scratch. Then change one thing. Remove the second question, cut to two sentences, reflect before advising, make the question concrete, or drop the generic reassurance.
Use realistic, messy user turns rather than polished ones. Examples become prompt examples, retrieved examples, eval cases, or training data later, and the tags become retrieval metadata.
Your worksheet
Convert failures into checks
Take the failure list and decide how each one should be caught. Use a deterministic check for countable patterns, such as more than one question mark or a banned phrase. Use a classifier for categorization, such as the current moment or whether content is safety-sensitive. Use a model-based critic for judgment, such as advice given too soon or a tone that is too clinical. Use human review for anything subjective or early.
A simple test. Could a script catch it? Use a deterministic check. Could a small model label it? Use a classifier. Does it need judgment? Use a critic or a person. Do you not know yet? Put it in the open questions.
For each failure, record the detection method, the acceptance criterion, an example pass and fail, and where it is logged.
Your worksheet
Build the engineering handoff
Do not end on discussion. End on a handoff table. For each artifact, capture the product decision behind it, the system behavior it requires, where it lives in the architecture, the backlog item, the acceptance criterion, the owner, and the open question.
A worked example. The artifact is the emotional disclosure moment card. The product decision is to reflect before advising when a user shares something vulnerable. The system behavior is questions off by default in that moment, with advice gated behind a user request or a prior reflective turn. The architecture location is the moment classifier, dynamic prompt context, generation rules, and eval set. The acceptance criterion is that across twenty emotional disclosure eval cases the agent reflects before advising in at least eighteen and asks no more than one question in all of them.
Your worksheet
›Reference and deeper notes
Common failure paths
These are the ways the session tends to go wrong, and how to keep it on track.
The team argues about adjectives
When "warm" or "human" means different things to different people, move straight to transcript behavior. Ask what you would see in a response that proves the trait, and what you would see if it failed. Record the trait, its behavioral definition, a good and bad pattern, any unresolved disagreement, and the backlog implication.
The team has no real transcripts
Use synthetic turns, but label them as assumptions and add a backlog item to replace them with real ones. For each, record the assumption being tested, why the case matters, what real evidence is needed, and how soon it should be replaced.
The team overbuilds the graph
Too many moments, too many branches, no clear path to implement. Collapse to five starter moments. Only split one when repeated transcript failures prove it is too broad. Record the proposed moment, the reason for not adding it yet, the evidence you would need, and a review date.
Engineering cannot implement the artifact
A polished personality document where no one knows where the rules live. Force every rule into an architecture location. System prompt, developer prompt, runtime context, conversation state, retrieved examples, classifier, critic, rewrite pass, eval set, logging, or human review. Record the rule, its location, the data it needs, an owner, and an acceptance criterion.
The team confuses safety with personality
Tone preferences get mixed with safety rules, and personality starts to weaken escalation. Separate the two. Safety overrides personality. Record the safety rule, the personality rule it overrides, the trigger, the required behavior, the escalation path, and an eval case.
The team personalizes too early
Different personalities per user, inferred from a few turns, drifting toward inconsistent or intrusive. Start with explicit preference adaptation only, such as "keep it short" or "be direct." Record the preference, how a user expresses it, the allowed and disallowed adaptation, where it is stored, and how the user can change it.
The team writes examples that are too polished
Examples that read well in isolation but do not match real turns, with the bad examples missing. Write the bad example first, use messy user turns, and build contrast rather than ideals. Record the messy turn, the likely bad response, the better response, the rule it demonstrates, and the failure tag.
The team cannot agree on good
Separate the decision from the evidence. Some calls are product taste, some are user evidence questions, and some are safety constraints. Record the disagreement, the decision needed, the evidence needed, a temporary default, the owner, and a date to revisit.
The team wants to solve everything with fine-tuning
When prompt failures turn into "we need training," stop. Training comes after you have examples, labels, evals, and a stable definition of quality. Record the behavior prompting could not control, the examples available, the eval coverage, the reason training might be needed later, and the current non-training mitigation.
The team ignores latency and cost
A harness that assumes classifiers, retrieval, critics, and rewrites everywhere will be too slow or expensive. Mark each control as alpha, beta, or later. Use deterministic checks first, and human review before automated layers when volume is low. Record each control layer, its latency and cost impact, whether it is needed for alpha, and the fallback if you do not build it.
The implementation brief
The session should hand engineering a brief with nine parts. The global prompt changes to add, remove, or rewrite. The dynamic context to pass into generation, such as the current moment, the selected move, a user preference, whether a question is allowed, a length target, relevant examples, and a recent summary. The moment states to detect, each with a definition, trigger, allowed moves, question policy, and fallback. The move labels, each with a purpose, default shape, question policy, and examples. The example library schema, with fields for the user turn, moment, move, bad and better response, why better, failure tag, tone and length tags, source, and version. The guard checks, each with a detection method, pass and fail conditions, and an action on fail. The eval set, each case with an expected moment, move, question count, length, and the failure it tests. The logging requirements, including the moment, move, prompt version, model, retrieved examples, question and sentence counts, guard failures, any rewrite, the final response, and a human rating. The open decisions, each with the reason it is unresolved, a temporary assumption, an owner, the evidence needed, and a next review.
Backlog translation examples
Question stacking becomes a question_count check, a compound_question check, a question_allowed flag in the generation context, a rewrite pass when the policy is violated, and eval cases for the failure. Acceptance criterion: no more than one question and no compound questions in at least ninety-five percent of the question discipline eval set.
Advice too soon becomes an emotional disclosure moment card with moment-specific allowed moves, a logged selected move, and eval cases for advice overreach. Acceptance criterion: the first response reflects or normalizes before advising in at least ninety percent of cases.
Generic support language becomes a banned phrase list, a phrase detector, a rewrite instruction for generic openings, and canonical grounded examples. Acceptance criterion: banned phrases appear in under two percent of test responses.
A request for shorter answers becomes a response_length_preference field, detected from explicit statements, stored as default, short, or detailed, injected into generation, with short-mode examples and adherence evals. Acceptance criterion: when the preference is short, responses stay within the length target in at least ninety percent of cases.
A tone that never shifts becomes defined moment labels, a lightweight moment classifier or explicit flow state, an injected moment card, logged moment selection, and moment-fit evals. Acceptance criterion: the selected moment matches the expected moment in at least eighty-five percent of cases.
Closing note
A personality prompt is easy to write and hard to operate. A personality harness is harder to write and easier to improve. The point of the session is not to make the system more complex. It is to make the team more precise. When personality is a prompt, teams argue about tone. When personality is a harness, they can inspect behavior, assign ownership, write tickets, test regressions, and improve the system over time.