Why 90% of RAG Systems Hallucinate
title: "Why 90% of RAG Systems Hallucinate" slug: "rag-hallucinations" description: "Most RAG failures come from three predictable places. Here's what they are and how to fix each one." date: "2026-04-10" tags: ["RAG", "LLM", "Production"] readingTime: "6 min" draft: false
Most developers blame hallucinations on the model. The model is usually the last thing at fault. In 90% of the RAG systems I've reviewed or built, the failure lives in one of three places — and none of them are the LLM.
Failure 1 — Chunking by token count, not by meaning
The default advice is to chunk your documents into 512-token windows with 50-token overlap. It's a reasonable starting point and a terrible production strategy.
Token-based chunking splits mid-sentence, mid-table, and mid-argument. When the retriever pulls a chunk that starts at "...therefore the liability cap applies" with no context for what "liability cap" refers to, the LLM fills that gap — with something plausible. That's hallucination by design.
Fix: chunk by semantic boundary. For structured documents, chunk by section heading. For contracts, chunk by clause. For research papers, chunk by paragraph. The rule is: every chunk should be independently answerable — if it needs the chunk before or after it to make sense, the split is in the wrong place.
# Bad: arbitrary token window
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=50)
# Better: split by meaningful boundary
text_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("##", "Section"), ("###", "Subsection")]
)
Failure 2 — No reranking step
Embedding models optimize for approximate similarity — they're fast and good enough for retrieval, not for ranking. The top-k chunks returned by cosine similarity are not necessarily the most relevant chunks for your specific query. They're the closest in vector space, which is a different thing.
Without a reranker, you're sending the 5 most geometrically similar chunks into the prompt. With a reranker, you're sending the 5 most contextually relevant ones.
The fix is a two-stage pipeline: retrieve 20 candidates with the embedding model, rerank with a cross-encoder, pass the top 5 to the LLM.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=5)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vector_store.as_retriever(search_kwargs={"k": 20})
)
This single change typically improves answer faithfulness more than switching embedding models.
Failure 3 — The prompt doesn't instruct refusal
If your system prompt doesn't explicitly tell the model what to do when the retrieved context doesn't contain the answer, it will guess. Every time.
LLMs default to helpfulness. Without a refusal instruction, the model interprets "I don't have this in context" as "I should synthesize an answer from my training data." That's not RAG — that's a hallucination pipeline with extra retrieval steps.
SYSTEM_PROMPT = """
You are a research assistant. Answer the user's question using ONLY the context
provided below. If the context does not contain sufficient information to answer
the question, respond with: "I don't have enough information in the provided
documents to answer this."
Do not use prior knowledge. Do not speculate. Cite the source for every claim.
"""
That's it. Three fixes. They're not glamorous, but they're the reason production RAG systems fail and tutorial RAG systems pass demos.
Related Articles
The 5-Step Pre-Flight Check for Production LLM Apps
Before you ship an LLM feature, run these five checks. Skip one and you'll fix it at 2am after a client reports it.
LangGraph vs Plain LangChain — When to Use Which
LangChain and LangGraph solve different problems. Picking the wrong one doubles your debugging time.