The 5-Step Pre-Flight Check for Production LLM Apps
title: "The 5-Step Pre-Flight Check for Production LLM Apps" slug: "production-llm-checklist" description: "Before you ship an LLM feature, run these five checks. Skip one and you'll fix it at 2am after a client reports it." date: "2026-04-20" tags: ["Production", "LLM", "Engineering"] readingTime: "8 min" draft: false
Shipping an LLM feature is not like shipping a REST endpoint. A REST endpoint either returns the right data or it throws an error. An LLM feature can return wrong data confidently, with good formatting, and a professional tone.
This is the checklist I run before every production deployment.
Step 1 — Retrieval eval
Before testing the LLM output, test the retrieval. The question to answer: for a set of real user queries, does the correct chunk come back in the top-k results?
Build a small evaluation set — 20 to 50 query-document pairs where you know the right answer and which chunk contains it. Run retrieval against this set and measure hit rate (did the right chunk appear in top-5?) and MRR (how high did it rank?).
def evaluate_retrieval(eval_set, retriever, k=5):
hits = 0
for item in eval_set:
results = retriever.invoke(item["query"])
retrieved_ids = [doc.metadata["chunk_id"] for doc in results[:k]]
if item["expected_chunk_id"] in retrieved_ids:
hits += 1
return hits / len(eval_set)
If hit rate is below 0.80, fix retrieval before touching the prompt. No prompt engineering compensates for missing context.
Step 2 — Faithfulness check
Faithfulness measures whether the model's answer is grounded in the retrieved context — or whether it's drifting into its training data.
The manual version: take 20 answers your system generated, read the source chunks, and mark each claim as "supported", "unsupported", or "contradicted". Unsupported plus contradicted should be under 5% before you ship.
The automated version: use an LLM-as-judge prompt.
FAITHFULNESS_PROMPT = """
Context: {context}
Answer: {answer}
Is every factual claim in the answer supported by the context?
Respond with JSON: {{"faithful": true/false, "unsupported_claims": [...]}}
"""
This is not a replacement for manual review, but it catches regressions quickly in CI.
Step 3 — Latency profiling
Measure where time is spent. The three usual bottlenecks: embedding generation, vector search, and LLM completion. Profile each independently.
import time
t0 = time.perf_counter()
query_embedding = embedder.embed_query(query)
t1 = time.perf_counter()
docs = vector_store.similarity_search_by_vector(query_embedding, k=5)
t2 = time.perf_counter()
answer = chain.invoke({"docs": docs, "query": query})
t3 = time.perf_counter()
print(f"Embed: {t1-t0:.3f}s | Retrieve: {t2-t1:.3f}s | Generate: {t3-t2:.3f}s")
Target: embed + retrieve under 300ms. If LLM completion is slow, switch to streaming — the total time is the same but perceived latency drops significantly.
Step 4 — Fallback handling
What does your system do when retrieval returns nothing relevant? What happens when the LLM API is down? What happens when the user's document is 200 pages and exceeds context limits?
Test each failure mode deliberately before users hit it accidentally.
Minimum fallback set:
- Empty retrieval → return a clear "I couldn't find relevant information" message
- API timeout → retry once with exponential backoff, then surface an error
- Context overflow → truncate by relevance score, not by order
async def safe_generate(query: str, docs: list) -> str:
if not docs:
return "I couldn't find relevant information in the provided documents."
try:
return await chain.ainvoke({"docs": docs, "query": query})
except asyncio.TimeoutError:
raise HTTPException(status_code=504, detail="Generation timed out.")
Step 5 — Observability
You cannot fix what you cannot see. Before launch, wire up tracing so every request has a full trace: query in, chunks retrieved, prompt sent, response out, latency per step.
LangSmith is the fastest way to get this for LangChain-based systems.
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "your-project-name"
That's it — every chain and graph invocation is now traced. After launch, your first job is to look at the traces for failed or low-quality responses and identify the pattern. It's almost always one of the first four steps on this list.
The order matters
Run these in order. Retrieval eval first, then faithfulness, then latency, then fallbacks, then observability. The reason: if retrieval is broken, steps 2–4 are testing a broken foundation. Fix the foundation first.
An LLM app that passes all five checks before launch is not a perfect system. It's a debuggable one.