Evals & quality·July 1, 2026·7 min read

Make the answer checkable

In guidance, a fluent answer that sounds right is not the same as one you can trace to a source. Close that gap on purpose.

Ask a large model to write a short biography of a real but not famous person and watch what comes back. It is fluent, confident, and partly made up. When researchers built FActScore, a way to break a generation into its individual factual claims and count the share a reliable source supports, they measured exactly this. On biographies, InstructGPT scored 42 percent supported and ChatGPT 58 percent, while a retrieval augmented system that looked things up first scored 71 percent (Min et al., EMNLP 2023). The same fluent prose, a very different amount of it true. In a coaching or care product, the 42 percent version does not read as broken. It reads as sure of itself, which is worse.

The problem is not that the model is dumb. Sounding right and being right come from two different processes, and only one of them is on by default.

Fluent and grounded are not the same thing

A base language model is trained to continue text plausibly. It has no separate step that checks a claim against a source, so when it does not know something it fills the gap with the most likely looking words rather than a flag that says it is unsure. Lilian Weng, surveying the research, splits this into two failures worth naming (Extrinsic Hallucinations in LLMs, 2024). In-context hallucination is when the answer contradicts the source material you handed the model. Extrinsic hallucination is when the answer is not grounded in anything the model was trained on and there is no source to check it against at all. Guidance products hit both, and the second is the dangerous one, because there is nothing on the page to argue with.

So the target is not a smarter model. It is an answer a reader, or a grader, can trace back to a source. The formal name for that property is attribution. A claim is attributable when it can be verified against an independent, provided source, the framework Google's team laid out and called Attributable to Identified Sources (Rashkin et al., Computational Linguistics 2023). Make the answer checkable and you convert a vibe into something you can measure. Leave it fluent and unchecked and you are shipping confidence with no way to know when it is earned.

Retrieval grounds the claim, a faithfulness check scores it

Grounding an answer is two moves, and it helps to keep them separate. First you give the model the source. Retrieval pulls the passages that bear on the question and puts them in front of the model, so the answer is written from real text instead of from whatever the weights half-remember. That is why the retrieval augmented system in the FActScore study scored 71 percent against ChatGPT's 58 percent. The model was working from a page it could read, not from memory.

Retrieval improves the odds. It does not guarantee the model actually used the passage, and a fluent model will happily write a claim the source never made. So the second move is to score whether the answer is supported by the retrieved context, and this is where the word faithfulness earns its place. Faithfulness measures how much of the answer the source actually backs. The RAGAS framework makes it a number you can watch (Es and James, 2023). A grader, usually another model, breaks the answer into its atomic claims, checks each one against the retrieved passages, and reports the fraction the context supports. All claims supported is a 1.0. Half of them floating free of the source is a 0.5. This is a different question from whether the answer is correct in the abstract. An answer can be true and still unfaithful, right by luck rather than grounded in the source you gave it.

When you have no source to retrieve against, there is a second family of checks that works on the model's own uncertainty. Semantic entropy samples several answers to the same question, clusters them by meaning rather than wording, and measures how much the meanings disagree (Farquhar et al., Nature 2024). A model that fabricates tends to give answers that scatter across incompatible meanings, and high entropy across those clusters is the tell. It flags the confident guess that has nothing under it, which is exactly the extrinsic case retrieval cannot reach.

Ground the generation, then grade the grounding

The concrete practice is a pair, not a single knob. Ground the generation, then run a faithfulness eval on the result, and treat the second as seriously as any other check that gates a release.

On the generation side, retrieve first and make the model cite what it used. You can build this by hand, or use a citation feature that does the plumbing. Anthropic's Citations chunks the source documents into sentences, passes them with the query, and returns the answer with pointers to the exact sentences each claim rests on (Anthropic, 2025). Because the pointers are extracted from the provided text rather than written by the model, they cannot point at a document that was never supplied. In Anthropic's internal testing that built-in approach raised recall accuracy by up to 15 percent over asking the model to cite sources through the prompt. A cited answer is a checkable answer, both for the reader and for the grader that comes next.

On the eval side, make faithfulness a metric you regress on, the same way you already gate on the rest of the suite. Score every answer for the share of its claims the retrieved context supports, watch that number release over release, and block a change that drops it. This is a different measurement than the ones the what-to-measure briefing covers, process adherence and the trust moments, so run it alongside them rather than folding it in. And it leans on a model to do the grading, so validate that grader against human judgment before you trust its faithfulness scores, the discipline the validate-the-judge briefing is built around.

Grounding buys trust, and it costs you

Retrieval is not free and it is not a fix on its own. Every answer now waits on a search before the model can start, which adds latency and a second system that can fail. Worse, retrieval can fetch the wrong passage. If the top results are topically on point but factually thin, or if two retrieved documents disagree, the model grounds its answer in the wrong thing and sounds just as sure as before. Long context makes this sharper, because a model attends unevenly to what you hand it, leaning on the start and the end and losing the passage buried in the middle (Liu et al., TACL 2024). The right source can be in the context window and still get ignored.

Grounding also cannot rescue a bad source. Feed the model an outdated policy or a wrong document and a faithful answer will faithfully repeat the error, scoring a clean 1.0 on faithfulness while being flatly wrong for the user. Faithfulness measures whether the answer matches the source, not whether the source is any good, and it is worth keeping those two apart in your head.

The hard evidence that grounding is not a guarantee comes from a domain where the vendors promised it was. A Stanford study of legal research tools that use RAG, marketed in some cases as avoiding hallucination, measured them hallucinating between 17 percent of the time for Lexis+ AI and 33 percent for Westlaw's assistant (Magesh, Surani et al., Journal of Empirical Legal Studies 2025). Retrieval helped, general purpose GPT-4 hallucinated 43 percent in the same test, but grounding cut the rate rather than closing it.

Ship the trace alongside the answer

The rule is short. In a high-trust product, an answer the user cannot trace to a source is a liability wearing the costume of an answer. So retrieve the source, make the model cite the span it used, and score every answer for how much of it the source actually backs. That gives the reader something to check and gives you a number, faithfulness, to move release over release. Add it to the suite and it compounds like the rest of your durable eval advantage, one more check a competitor with the same model does not have.

Where does this stop being enough? Faithfulness tells you the answer matches the retrieved source. It says nothing about whether the source was right, and nobody has a clean, cheap way to grade the quality of the source at answer time. That is the open edge. The best products treat the traceable answer as the floor, not the finish, and keep a human in the loop on the turns where a wrong source would cost the most.

Sources and further reading

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Lewis et al., Facebook AI Research, NeurIPS 2020
Measuring Attribution in Natural Language Generation Models. Rashkin et al., Google, Computational Linguistics 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Min et al., EMNLP 2023
RAGAS: Automated Evaluation of Retrieval Augmented Generation. Es and James, 2023
Detecting hallucinations in large language models using semantic entropy. Farquhar, Kossen, Kuhn, Gal, University of Oxford, Nature 2024
Extrinsic Hallucinations in LLMs. Lilian Weng, 2024
Introducing Citations on the Anthropic API. Anthropic, 2025
Lost in the Middle: How Language Models Use Long Contexts. Liu et al., Stanford, TACL 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Magesh, Surani et al., Stanford RegLab, Journal of Empirical Legal Studies 2025

Bring us the hardest moment in your product.

We build the evals that define a good answer and the loops that keep a conversational product improving. Tell us where yours is hard to measure and we will map what it takes.

Pressure-test my product →See how we work →

Make the answer checkable

Fluent and grounded are not the same thing

Retrieval grounds the claim, a faithfulness check scores it

Ground the generation, then grade the grounding

Grounding buys trust, and it costs you

Ship the trace alongside the answer

Related reading

Close the loop from production to eval

Trust the judge only after you check it against people

Choose and swap models against your own bar

Bring us the hardest moment in your product.