Why Your RAG System Hallucinates — and the PM Fixes

RAG was supposed to stop hallucinations. Often it doesn't. Here are the five reasons retrieval-augmented systems still make things up, and the product-level fixes for each — most of which aren't model problems at all.

The pitch for retrieval-augmented generation is that grounding a model in your documents stops it from making things up. Then you ship one and it still hallucinates. Having built RAG systems for enterprise knowledge and research-paper Q and A, I can tell you the surprising part: most RAG hallucinations are not model problems. They are retrieval and product problems the PM can fix.

Reason 1: retrieval missed, the model guessed

The most common cause. The retriever fails to surface the relevant chunk, so the model answers from its parametric memory instead of your documents — and that memory is exactly the unreliable source RAG was meant to replace. The model is doing what you asked; the retrieval failed silently.

Fix: measure retrieval separately from generation. Track whether the right chunk was in the top results at all. If retrieval recall is the problem, no prompt tweak will save you — you fix chunking, embeddings, or reranking.

Reason 2: the chunks are wrong-sized

Chunk too small and you sever the context a passage needs to make sense. Chunk too large and the relevant sentence drowns in noise the embedding averages away. Either way retrieval quality collapses.

Fix: chunk on semantic boundaries, not fixed character counts. Test chunk strategy as a variable with a real retrieval eval set, the same way you would test any product hypothesis.

Reason 3: no reranking

Vector similarity is a blunt instrument. The top embedding match is often not the most relevant passage — it is the most superficially similar one. On a Coca-Cola 10-K spanning a decade, naive top-k retrieval pulled the wrong year constantly until a reranker fixed the ordering.

Fix: add a reranking step that re-scores the retrieved candidates for actual relevance before they reach the model. It is one of the highest-return additions to a RAG pipeline.

Reason 4: the question was too big for one retrieval

A question like "how did revenue and risk factors change across these years" is really several questions. One retrieval cannot serve all of them, so the model fills gaps by inventing.

Fix: decompose. A sub-question engine breaks a complex query into parts, retrieves for each, and composes the answer. This single pattern eliminated a large share of hallucinations in my document-Q-and-A builds.

Reason 5: the prompt never told the model to abstain

If you never instruct the model that "I don't know" is an acceptable answer, it will not give one. Models default to helpfulness, and helpfulness without grounding is hallucination.

Fix: instruct the model to answer only from the retrieved context and to say it cannot find the answer when the context does not contain it. Then — critically — measure how often it correctly abstains. An abstention the model never makes is a setting you only think you have.

The PM takeaway

When a RAG system hallucinates, the instinct is "the model is bad" or "let's fine-tune." Usually the fix is upstream and cheaper: better retrieval, smarter chunking, a reranker, query decomposition, and an explicit abstention path — each measurable on a retrieval eval set you own. RAG quality is a pipeline you instrument, not a model you pray to.