Ask a large model to write a short biography of a real but not famous person and watch what comes back. It is fluent, confident, and partly made up. When researchers built FActScore, a way to break a generation into its individual factual claims and count the share a reliable source supports, they measured exactly this. On biographies, InstructGPT scored 42 percent supported and ChatGPT 58 percent, while a retrieval augmented system that looked things up first scored 71 percent (Min et al., EMNLP 2023). The same fluent prose, a very different amount of it true. In a coaching or care product, the 42 percent version does not read as broken. It reads as sure of itself, which is worse.
The problem is not that the model is dumb. Sounding right and being right come from two different processes, and only one of them is on by default.
Fluent and grounded are not the same thing
A base language model is trained to continue text plausibly. It has no separate step that checks a claim against a source, so when it does not know something it fills the gap with the most likely looking words rather than a flag that says it is unsure. Lilian Weng, surveying the research, splits this into two failures worth naming (Extrinsic Hallucinations in LLMs, 2024). In-context hallucination is when the answer contradicts the source material you handed the model. Extrinsic hallucination is when the answer is not grounded in anything the model was trained on and there is no source to check it against at all. Guidance products hit both, and the second is the dangerous one, because there is nothing on the page to argue with.
So the target is not a smarter model. It is an answer a reader, or a grader, can trace back to a source. The formal name for that property is attribution. A claim is attributable when it can be verified against an independent, provided source, the framework Google's team laid out and called Attributable to Identified Sources (Rashkin et al., Computational Linguistics 2023). Make the answer checkable and you convert a vibe into something you can measure. Leave it fluent and unchecked and you are shipping confidence with no way to know when it is earned.
Retrieval grounds the claim, a faithfulness check scores it
Grounding an answer is two moves, and it helps to keep them separate. First you give the model the source. Retrieval pulls the passages that bear on the question and puts them in front of the model, so the answer is written from real text instead of from whatever the weights half-remember. That is why the retrieval augmented system in the FActScore study scored 71 percent against ChatGPT's 58 percent. The model was working from a page it could read, not from memory.
Retrieval improves the odds. It does not guarantee the model actually used the passage, and a fluent model will happily write a claim the source never made. So the second move is to score whether the answer is supported by the retrieved context, and this is where the word faithfulness earns its place. Faithfulness measures how much of the answer the source actually backs. The RAGAS framework makes it a number you can watch (Es and James, 2023). A grader, usually another model, breaks the answer into its atomic claims, checks each one against the retrieved passages, and reports the fraction the context supports. All claims supported is a 1.0. Half of them floating free of the source is a 0.5. This is a different question from whether the answer is correct in the abstract. An answer can be true and still unfaithful, right by luck rather than grounded in the source you gave it.
When you have no source to retrieve against, there is a second family of checks that works on the model's own uncertainty. Semantic entropy samples several answers to the same question, clusters them by meaning rather than wording, and measures how much the meanings disagree (Farquhar et al., Nature 2024). A model that fabricates tends to give answers that scatter across incompatible meanings, and high entropy across those clusters is the tell. It flags the confident guess that has nothing under it, which is exactly the extrinsic case retrieval cannot reach.
Ground the generation, then grade the grounding
The concrete practice is a pair, not a single knob. Ground the generation, then run a faithfulness eval on the result, and treat the second as seriously as any other check that gates a release.
On the generation side, retrieve first and make the model cite what it used. You can build this by hand, or use a citation feature that does the plumbing. Anthropic's Citations chunks the source documents into sentences, passes them with the query, and returns the answer with pointers to the exact sentences each claim rests on (Anthropic, 2025). Because the pointers are extracted from the provided text rather than written by the model, they cannot point at a document that was never supplied. In Anthropic's internal testing that built-in approach raised recall accuracy by up to 15 percent over asking the model to cite sources through the prompt. A cited answer is a checkable answer, both for the reader and for the grader that comes next.
On the eval side, make faithfulness a metric you regress on, the same way you already gate on the rest of the suite. Score every answer for the share of its claims the retrieved context supports, watch that number release over release, and block a change that drops it. This is a different measurement than the ones the what-to-measure briefing covers, process adherence and the trust moments, so run it alongside them rather than folding it in. And it leans on a model to do the grading, so validate that grader against human judgment before you trust its faithfulness scores, the discipline the validate-the-judge briefing is built around.
Grounding buys trust, and it costs you
Retrieval is not free and it is not a fix on its own. Every answer now waits on a search before the model can start, which adds latency and a second system that can fail. Worse, retrieval can fetch the wrong passage. If the top results are topically on point but factually thin, or if two retrieved documents disagree, the model grounds its answer in the wrong thing and sounds just as sure as before. Long context makes this sharper, because a model attends unevenly to what you hand it, leaning on the start and the end and losing the passage buried in the middle (Liu et al., TACL 2024). The right source can be in the context window and still get ignored.
Grounding also cannot rescue a bad source. Feed the model an outdated policy or a wrong document and a faithful answer will faithfully repeat the error, scoring a clean 1.0 on faithfulness while being flatly wrong for the user. Faithfulness measures whether the answer matches the source, not whether the source is any good, and it is worth keeping those two apart in your head.
The hard evidence that grounding is not a guarantee comes from a domain where the vendors promised it was. A Stanford study of legal research tools that use RAG, marketed in some cases as avoiding hallucination, measured them hallucinating between 17 percent of the time for Lexis+ AI and 33 percent for Westlaw's assistant (Magesh, Surani et al., Journal of Empirical Legal Studies 2025). Retrieval helped, general purpose GPT-4 hallucinated 43 percent in the same test, but grounding cut the rate rather than closing it.
Ship the trace alongside the answer
The rule is short. In a high-trust product, an answer the user cannot trace to a source is a liability wearing the costume of an answer. So retrieve the source, make the model cite the span it used, and score every answer for how much of it the source actually backs. That gives the reader something to check and gives you a number, faithfulness, to move release over release. Add it to the suite and it compounds like the rest of your durable eval advantage, one more check a competitor with the same model does not have.
Where does this stop being enough? Faithfulness tells you the answer matches the retrieved source. It says nothing about whether the source was right, and nobody has a clean, cheap way to grade the quality of the source at answer time. That is the open edge. The best products treat the traceable answer as the floor, not the finish, and keep a human in the loop on the turns where a wrong source would cost the most.
Sources and further reading
Work with Hunter Green