Skip to content
moizz.dev
6 min

Why 90% of RAG Systems Hallucinate

RAGLLMProduction

title: "Why 90% of RAG Systems Hallucinate" slug: "rag-hallucinations" description: "Most RAG failures come from three predictable places. Here's what they are and how to fix each one." date: "2026-04-10" tags: ["RAG", "LLM", "Production"] readingTime: "6 min" draft: false

Most developers blame hallucinations on the model. The model is usually the last thing at fault. In 90% of the RAG systems I've reviewed or built, the failure lives in one of three places — and none of them are the LLM.

Failure 1 — Chunking by token count, not by meaning

The default advice is to chunk your documents into 512-token windows with 50-token overlap. It's a reasonable starting point and a terrible production strategy.

Token-based chunking splits mid-sentence, mid-table, and mid-argument. When the retriever pulls a chunk that starts at "...therefore the liability cap applies" with no context for what "liability cap" refers to, the LLM fills that gap — with something plausible. That's hallucination by design.

Fix: chunk by semantic boundary. For structured documents, chunk by section heading. For contracts, chunk by clause. For research papers, chunk by paragraph. The rule is: every chunk should be independently answerable — if it needs the chunk before or after it to make sense, the split is in the wrong place.

# Bad: arbitrary token window
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=50)

# Better: split by meaningful boundary
text_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("##", "Section"), ("###", "Subsection")]
)

Failure 2 — No reranking step

Embedding models optimize for approximate similarity — they're fast and good enough for retrieval, not for ranking. The top-k chunks returned by cosine similarity are not necessarily the most relevant chunks for your specific query. They're the closest in vector space, which is a different thing.

Without a reranker, you're sending the 5 most geometrically similar chunks into the prompt. With a reranker, you're sending the 5 most contextually relevant ones.

The fix is a two-stage pipeline: retrieve 20 candidates with the embedding model, rerank with a cross-encoder, pass the top 5 to the LLM.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=5)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever(search_kwargs={"k": 20})
)

This single change typically improves answer faithfulness more than switching embedding models.

Failure 3 — The prompt doesn't instruct refusal

If your system prompt doesn't explicitly tell the model what to do when the retrieved context doesn't contain the answer, it will guess. Every time.

LLMs default to helpfulness. Without a refusal instruction, the model interprets "I don't have this in context" as "I should synthesize an answer from my training data." That's not RAG — that's a hallucination pipeline with extra retrieval steps.

SYSTEM_PROMPT = """
You are a research assistant. Answer the user's question using ONLY the context
provided below. If the context does not contain sufficient information to answer
the question, respond with: "I don't have enough information in the provided
documents to answer this."

Do not use prior knowledge. Do not speculate. Cite the source for every claim.
"""

That's it. Three fixes. They're not glamorous, but they're the reason production RAG systems fail and tutorial RAG systems pass demos.

Moizz K

Moizz K

Full-stack AI engineer — RAG, Agents, LLM products

Related Articles