Retrieval-augmented generation is now the default pattern for grounding LLM answers in your own data. But moving a quick demo to production exposes a long tail of issues that tutorials rarely cover. This post is the checklist we use on client engagements to take RAG from notebook to production.
Why naive RAG breaks in production
The first demo is easy: embed documents, store them in a vector database, retrieve top-k chunks, stuff them into a prompt. Real users immediately surface the gaps — stale answers, hallucinated citations, mismatched topics, and unbearable latency on long documents.
The fix is not a single big change. It is a series of small, compounding improvements across ingestion, retrieval, generation, and evaluation.
Ingestion: chunking is the lever
Most RAG quality problems originate at ingestion. Chunk size and overlap dominate retrieval quality more than the embedding model choice does.
- Prefer semantic chunking (by heading, section, or sentence clusters) over fixed token windows.
- Preserve structural metadata — title, section path, source URL — alongside each chunk.
- For tables and code, keep them as single chunks. Breaking them mid-row destroys meaning.
Retrieval: hybrid almost always wins
Pure vector search misses exact keyword matches (product codes, SKUs, error IDs). Pure keyword search misses paraphrased queries. Combine both.
Rule of thumb: hybrid search + a reranker beats any single embedding model tweak by a wide margin.
Our default stack: Postgres with pgvector for dense retrieval, BM25 for keyword, and a cross-encoder reranker on the top 50 merged results.
Generation: constrain the model
Prompt engineering matters, but structural choices matter more:
- Always include source URLs in context and require the model to cite them.
- Add a refusal clause: "If the context does not contain the answer, say so."
- Use JSON-mode or tool calling when downstream code consumes the output.
Evaluation: automate it from day one
You cannot improve what you cannot measure. Build an eval set of 50–200 real user questions with expected answers before launch. Re-run it on every retrieval or prompt change.
We use Ragas for retrieval and generation metrics (faithfulness, context precision, answer relevance). Pair it with a handful of end-to-end "golden path" tests in CI.
Latency: the boring optimizations
- Cache embeddings and reranker outputs aggressively.
- Stream responses to the UI — perceived latency drops dramatically.
- Parallelize retrieval and reranking where possible.
- Pre-compute common query embeddings nightly.
What we wish we knew earlier
RAG is 70% data engineering, 20% evaluation, and 10% prompt design. Teams that invert those ratios usually end up rewriting the system six months in. Invest in your ingestion pipeline and eval harness first.
← Back to blog