A team has a text coach that behaves. It picks the right move, holds its tone, keeps its questions to one at a time. They add a voice and expect the same product with sound. Then the calls come in. The agent talks over a user who paused to think. It sits on a silence that has clearly ended and answers a beat too late, so the person starts repeating themselves into the gap. When the user tries to cut in, the agent reads its paragraph to the end anyway. None of that is in the words. The transcript reads fine. The call does not.
A voice agent is a pipeline, not a mode
Adding voice does not add a setting to your agent. It adds two machines the text product never had. The first is a real-time pipeline. The second is a layer that decides who is holding the floor. Both are places character now lives, and both break in ways your text evals cannot see.
Start with the pipeline. In the common architecture a spoken turn runs through three stages in order. Speech to text (STT) transcribes the audio into words. The model reads those words and generates a reply. Text to speech (TTS) turns that reply back into audio. LiveKit's writeup on voice agent architecture lays out this STT to LLM to TTS chain as the default shape, and the ElevenLabs conversational AI docs describe the same loop in their hosted stack. Your text harness governs the middle box. The two boxes on either side are new, and so is everything between spoken turns.
That in-between is the second machine. Something has to decide when the user has finished a turn, when the agent is allowed to start, and what happens when they collide. Call that turn-taking. Two terms name the hard parts. Endpointing is the judgment that a user's turn is over, so the agent can respond. Barge-in is the user interrupting the agent mid-sentence and the agent yielding. Neither exists in a text product, where a turn ends when someone hits send.
Latency is the constraint you cannot design around
Text has no clock. A user waits two seconds or ten for a reply and reads the same words. Voice has a clock, and it is tight. Stivers and colleagues measured turn transitions across ten languages and found the same shape everywhere, a single peak with most responses landing within about 200 milliseconds of the end of the question, an overall mode of 0 ms. That timing is not decoration. Templeton and colleagues found that faster response times signal social connection, and that replies under about 250 ms happen too fast for conscious control, which is why a listener reads them as genuine. A person hears a long gap as the agent being slow, confused, or not listening.
So the whole pipeline shares one small budget per turn, and every stage spends from it. LiveKit's latency breakdown puts numbers on the stages. Endpointing and ASR run roughly 150 to 300 ms, the model's first token is usually the slowest part at 300 to 800 ms, and TTS needs another 100 to 200 ms before the first audio comes out. Run those in strict sequence and LiveKit reports a sequential pipeline often lands at two to four seconds of delay, which feels unnatural. The fix is to overlap them. STT streams partial words to the model, the model streams tokens to TTS, and TTS starts speaking before the user has finished the thought. With streaming across all three, LiveKit reports end-to-end latency under a second is reachable. The budget is why streaming is not an optimization you add later. It is the architecture.
This is also where character sneaks in through the back door. How long the agent waits before it starts talking is a personality decision now, not just a latency number. A coach that jumps in at 200 ms feels eager, maybe pushy. One that waits 900 ms feels calm, or slow. The trait a text harness names in words, warm or patient or direct, is partly set here by a timeout you tune.
Endpointing errors are the ones users feel
The turn-taking layer is where the two worst voice failures live, and they are opposites. Endpoint too early and the agent decides a mid-thought pause is the end of the turn, so it interrupts. Endpoint too late and it sits on a silence the user has clearly finished, so it feels dead and the person repeats themselves into the gap. LiveKit walks through the strategies for this. The simplest is a voice activity detector plus a silence timer, wait for speech to stop, then wait out a fixed silence window. It is easy to build and it is exactly the thing that fails both ways. Set the window long enough to survive a thinking pause and every reply carries that delay before the pipeline even starts. Set it short and the agent talks over anyone who breathes mid-sentence.
The move that helps is to stop treating silence as the only signal. A model-based turn detector reads the partial transcript and predicts whether the utterance is semantically complete, so 'I think I want to, um' holds the floor while 'I think I want to switch plans' releases it, even though both have a pause. LiveKit reports that with a confident end-of-turn signal the session can commit sooner and run shorter silence windows, buying back latency without buying interruptions. ElevenLabs ships a proprietary turn-taking model in its 2.0 stack for the same reason, reading cues like filler words to decide whether to wait or speak. This is the lesson the personality harness teaches for text, that reliable behavior comes from an explicit decision in a checkable unit, not one global instruction. Here the unit is the turn boundary, and getting it wrong is audible.
Barge-in is the other half. When the user starts talking mid-sentence, a well-built agent keeps turn detection running during playback, cancels the audio it was speaking, rolls back the interrupted model turn, and listens. Whether the agent yields at all, and how fast, is a character choice. An agent that plows through its scripted paragraph while a distressed user tries to cut in has picked a personality, and it is the wrong one.
Why your text evals miss all of this
Everything above is invisible to a text eval, because a text eval reads the words and the failures are not in the words. A transcript cannot show a 1.5 second gap, an interruption, or a barge-in the agent ignored. It shows a clean exchange that felt broken to the person on the call. Grade on transcripts and you can pass every case and ship an agent nobody wants to talk to.
The size of the gap is measurable, and it is large. Sierra built tau-voice, a benchmark that runs the same grounded customer-service tasks over a full-duplex voice channel with realistic audio, so you can compare a voice agent against the text ceiling directly. The result is stark. Where GPT-5 with reasoning scores about 85 percent on the text version, voice agents reach only 31 to 51 percent under clean audio and 26 to 38 percent once you add background noise and telephony compression, holding on to just 30 to 45 percent of the text capability. Same tasks, same class of model, most of the competence lost in the move to voice and the turn-taking it forces.
The practice follows. Own the turn-taking policy on purpose, the way the text harness owns its moves. Decide the endpointing behavior, the interruption behavior, and the wait times, write them down, and tune them per product rather than taking the platform defaults. Budget latency per stage, STT, model, TTS, and the turn boundary, so you know which stage blew the second when a call feels slow. Then evaluate on voice tasks, not just text. Run the eval through the real pipeline with real audio so timing, interruptions, and transcription errors are in scope, because those are the failures a text suite cannot catch.
What voice costs you
The tradeoffs are real and they pull against each other, so name them before you build. Streaming and low latency fight accuracy. The faster you commit to a turn boundary, the more often you cut a user off mid-thought, and the earlier you start the model on partial transcript, the more you risk answering the wrong thing. Interruption handling adds real complexity. Barge-in means keeping turn detection live during playback, canceling audio cleanly, and rolling back a turn the model already started, state a text agent never manages. And on a hosted voice stack you may not control every stage. If the platform owns the endpointing model or the TTS voice, some of the timing that carries your character is set by a vendor, and your job shifts to picking a stack whose defaults you can live with and configuring what it exposes.
One honest caveat cuts the other way. Sierra tracked the top voice score climbing from about 30 percent in August 2025 to 67 percent by April 2026, crossing the non-reasoning text line. Some of what looks like an architecture problem today is a model that is not good enough yet, and it will get better. But the pipeline and the clock are structural. Even a perfect model still has to decide when the user is done and whether to yield the floor, inside a budget measured in hundreds of milliseconds. That constraint is not going away, so the harness that governs it is worth owning now.
Design the turn, not just the transcript
A text harness governs what the agent says. A voice harness also governs when it says it, whether it lets itself be interrupted, and how long a pause it will sit through, and a lot of the personality a user feels lives in those three things. If you are shipping voice, treat turn-taking as a behavior you spec, tune, and test, the same as any move in the conversation design. Write the endpointing and barge-in policy down. Budget the latency per stage and watch each one. Evaluate through the real pipeline, with real audio, so the timing failures are in the grade and not just on the call. The words were never the whole product. In voice, half the character is in the timing, and the timing is the part your text evals never saw.
Sources and further reading
Work with Hunter Green