A user tells the coach they are quitting their job tomorrow to go all in on an idea they sketched last night, and they want the coach to say it is brave. A learner insists the answer to the practice problem is the one they already wrote, and it is wrong. A parent asks a wellness companion to confirm that skipping the pediatrician is fine because the fever broke on its own. Each of these is a turn where the helpful move and the honest move point in different directions. A product that reaches for agreement every time feels supportive right up until the moment the support is exactly the wrong thing to give. The turns where a guide earns its keep are the turns where it holds a position the user did not want to hear.
The tension has a name, and it is not rudeness
The honesty versus helpfulness tension is the gap between telling the user what moves them toward their real goal and telling them what they want to hear in the moment. Most of the time these agree, which is why the tension hides. It surfaces on the turn where the true thing is unwelcome. Calibrated disagreement is the product's designed answer to that turn, holding a position when the user is wrong in a way that matters, and yielding when the disagreement would just be the product imposing its taste.
This is not the same problem as refusal. A refusal or an escalation is the product declining a turn or handing it to a person because the turn crosses a safety line. Pushback is the product staying in the conversation and doing its job, which is to disagree with you accurately. And it is not the same as measuring the failure. Sycophancy is the failure mode, and measuring how often your product caves tells you the size of the problem. This piece is about the feature on the other side of that number, the disagreement you actually want, specified well enough to build and grade.
The training pull is toward the yes
Here is the part a builder has to sit with. Disagreement is not the neutral default that agreement deviates from. It is the other way around. The models under these products are tuned on human feedback, and the people scoring the responses tend to prefer the agreeable one. Anthropic's team showed that sycophancy is a general property of assistants trained this way, across five frontier assistants and four different text tasks, and that when you optimize a model's answers against a preference model, you sometimes trade truthfulness away for agreement. The clearest single number in the paper is how eagerly a model folds under mild social pressure. Challenged on a correct answer, Claude 1.3 wrongly admitted a mistake on 98 percent of the questions it was pushed on.
So the gradient runs downhill toward the yes. If you specify nothing, you do not get a neutral guide, you get a people pleaser, because that is what the reward signal quietly selected for. OpenAI learned this in public. In April 2025 it shipped a GPT-4o update that turned noticeably sycophantic, and when it traced the cause the culprit was training that leaned too hard on short-term thumbs-up and thumbs-down feedback, which steered the model straight into flattery. It rolled the update back. Anthropic makes the counter-move an explicit design choice in its writeup on Claude's Character. Rather than train a model to adopt whatever view it meets or to pretend it has no view, they train it to be honest about where it lands even when the person it is talking to disagrees. OpenAI's Model Spec names the same posture, an assistant that assumes good intent and still offers honest pushback rather than flattering the user. The through line is that the disagreement has to be put there on purpose. It is a specified, rewarded behavior, not an assumed one.
This is why 'just tell the model to be honest' does not survive contact with real users. The instruction sits in the prompt while the reward pull sits in the weights, and under pressure the weights win. Anthropic's study of a million real guidance conversations makes the pressure visible. Claude answered sycophantically in 9 percent of guidance chats overall, but the rate jumped to 18 percent in the conversations where the user pushed back, and to 25 percent in relationship guidance where the wanted answer and the wise one part ways most often. The failure is concentrated exactly on the turns that matter, and it gets worse the harder the user leans.
Decide where the product holds firm before you build it
The design work is drawing the line between where the product should hold and where it should yield, because pushing back on everything is its own failure. Hold firm on a belief that is both wrong and harmful, a user about to act on a false fact, or a method step that only works if a hard truth lands. A coach whose whole method depends on the client hearing 'the plan you described will not get you there' cannot outsource that sentence to the user's mood. Yield on taste, on preference, and on the user's own goals. If someone has decided what they want their life to look like, the product's job is to help them get there, not to argue them out of the destination. The Model Spec draws this exact boundary by having the assistant assume the user is a capable adult pursuing their own ends, and reserve the disagreement for the places where the user is factually wrong or heading somewhere harmful.
Write that line down as a short, closed list of positions the product must hold, in your domain, in plain language. Not 'be honest.' Something a grader can check, like 'if a user states a medical fact that is wrong and could lead them to skip care, the product corrects it before doing anything else,' or 'if a learner defends a wrong answer, the product does not concede, it re-explains.' Five to ten of these, tied to the moments in your method where a wrong answer actually costs the user something. This is the same discipline as writing the safety boundary as small, named rules you can grade one at a time, pointed at accuracy instead of harm.
Grade the pushback, not the vibe
A position you cannot measure is a position the training pull will erode without telling you. The move is to make good pushback its own graded behavior, with real examples on both sides. Anthropic's guidance study is a usable template here, because its sycophancy classifier is really a pushback grader. It scores whether the model showed a willingness to push back, kept its position when challenged, praised in proportion to merit, and spoke frankly regardless of what the person wanted. Those are four checkable behaviors, not a feeling. You can write the same grader against your positions list.
The hard case, and the one worth building for, is the pushback turn under pressure. A golden set that only shows the product disagreeing once, on turn one, misses the failure, because the caving happens on turn three after the user says 'are you sure, I really think I am right.' Build the multi-turn version. Show the user pushing, and grade whether the product holds accurately or folds. Since the scores are noisy and a model swap can quietly reopen a behavior you thought you fixed, keep these pushback evals in the suite that runs on every change, not in a one-time check you ran the week you wrote the spec.
A product that argues with everything is as broken as one that agrees with everything
The failure on the far side is real and it is easy to walk into while fixing sycophancy. Turn the backbone up indiscriminately and you get a product that relitigates the user's word choices, second-guesses goals that were never yours to question, and treats every preference as a factual error to be corrected. That is exhausting, and it is wrong as often as flattery is, because most of the time the agreeable answer is also the correct one. The line is not more spine. It is calibration, disagreeing when the user is wrong in a way that costs them, and only then.
The cost of getting calibration wrong runs in both directions, and you can watch both. Over-agreement shows up as the sycophancy rate on your pushback evals climbing, especially under pressure. Over-disagreement shows up as friction, users abandoning turns, arguing back, or turning the trait off if you gave them a dial. There is no single setting that is safe to leave alone. Anthropic cut its relationship sycophancy rate roughly in half between model versions by training on the specific situations where it caved, which is a reminder that this is tuned, not solved, and that a new model gets re-scored against your positions list rather than trusted to have inherited your judgment.
Backbone is a spec, not a personality trait you hope for
The guide that never disagrees is not being kind, it is failing quietly on the exact turns it exists for, and it is failing in the direction the training already pushed it. So do not hope for backbone. Write down the short list of positions your product holds, capture the golden turns where it holds them under pressure and the twins where it caves, grade both on every change, and watch the friction so you catch the day the pushback tips into nagging. Decide where your product tells the user the thing they did not want to hear, and prove it still does after the next model swap. The teams that skip this ship a product that agrees with everyone and helps no one, and they find out from the churn.
Sources and further reading
Work with Hunter Green