The 5-Step Pre-Flight Check for Production LLM Apps

Shipping an LLM feature is not like shipping a REST endpoint. A REST endpoint either returns the right data or it throws an error. An LLM feature can return wrong data confidently, with good formatting, and a professional tone.

This is the checklist I run before every production deployment.

Step 1 — Retrieval eval

Before testing the LLM output, test the retrieval. The question to answer: for a set of real user queries, does the correct chunk come back in the top-k results?

Build a small evaluation set — 20 to 50 query-document pairs where you know the right answer and which chunk contains it. Run retrieval against this set and measure hit rate (did the right chunk appear in top-5?) and MRR (how high did it rank?).

def evaluate_retrieval(eval_set, retriever, k=5):
    hits = 0
    for item in eval_set:
        results = retriever.invoke(item["query"])
        retrieved_ids = [doc.metadata["chunk_id"] for doc in results[:k]]
        if item["expected_chunk_id"] in retrieved_ids:
            hits += 1
    return hits / len(eval_set)

If hit rate is below 0.80, fix retrieval before touching the prompt. No prompt engineering compensates for missing context.

Step 2 — Faithfulness check

Faithfulness measures whether the model's answer is grounded in the retrieved context — or whether it's drifting into its training data.

The manual version: take 20 answers your system generated, read the source chunks, and mark each claim as "supported", "unsupported", or "contradicted". Unsupported plus contradicted should be under 5% before you ship.

The automated version: use an LLM-as-judge prompt.

FAITHFULNESS_PROMPT = """
Context: {context}
Answer: {answer}

Is every factual claim in the answer supported by the context?
Respond with JSON: {{"faithful": true/false, "unsupported_claims": [...]}}
"""

This is not a replacement for manual review, but it catches regressions quickly in CI.

Step 3 — Latency profiling

Measure where time is spent. The three usual bottlenecks: embedding generation, vector search, and LLM completion. Profile each independently.

import time

t0 = time.perf_counter()
query_embedding = embedder.embed_query(query)
t1 = time.perf_counter()

docs = vector_store.similarity_search_by_vector(query_embedding, k=5)
t2 = time.perf_counter()

answer = chain.invoke({"docs": docs, "query": query})
t3 = time.perf_counter()

print(f"Embed: {t1-t0:.3f}s | Retrieve: {t2-t1:.3f}s | Generate: {t3-t2:.3f}s")

Target: embed + retrieve under 300ms. If LLM completion is slow, switch to streaming — the total time is the same but perceived latency drops significantly.

Step 4 — Fallback handling

What does your system do when retrieval returns nothing relevant? What happens when the LLM API is down? What happens when the user's document is 200 pages and exceeds context limits?

Test each failure mode deliberately before users hit it accidentally.

Minimum fallback set:

Empty retrieval → return a clear "I couldn't find relevant information" message
API timeout → retry once with exponential backoff, then surface an error
Context overflow → truncate by relevance score, not by order

async def safe_generate(query: str, docs: list) -> str:
    if not docs:
        return "I couldn't find relevant information in the provided documents."
    try:
        return await chain.ainvoke({"docs": docs, "query": query})
    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Generation timed out.")

Step 5 — Observability

You cannot fix what you cannot see. Before launch, wire up tracing so every request has a full trace: query in, chunks retrieved, prompt sent, response out, latency per step.

LangSmith is the fastest way to get this for LangChain-based systems.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "your-project-name"

That's it — every chain and graph invocation is now traced. After launch, your first job is to look at the traces for failed or low-quality responses and identify the pattern. It's almost always one of the first four steps on this list.

The order matters

Run these in order. Retrieval eval first, then faithfulness, then latency, then fallbacks, then observability. The reason: if retrieval is broken, steps 2–4 are testing a broken foundation. Fix the foundation first.

An LLM app that passes all five checks before launch is not a perfect system. It's a debuggable one.

Moizz K

Full-stack AI engineer — RAG · Agents · LLM products