Building Production-Ready RAG Pipelines

A practical walkthrough — from embeddings and vector stores to evaluation and latency tuning.

Retrieval-augmented generation is now the default pattern for grounding LLM answers in your own data. But moving a quick demo to production exposes a long tail of issues that tutorials rarely cover. This post is the checklist we use on client engagements to take RAG from notebook to production.

Why naive RAG breaks in production

The first demo is easy: embed documents, store them in a vector database, retrieve top-k chunks, stuff them into a prompt. Real users immediately surface the gaps — stale answers, hallucinated citations, mismatched topics, and unbearable latency on long documents.

The fix is not a single big change. It is a series of small, compounding improvements across ingestion, retrieval, generation, and evaluation.

Ingestion: chunking is the lever

Most RAG quality problems originate at ingestion. Chunk size and overlap dominate retrieval quality more than the embedding model choice does. A chunk that is too large dilutes the embedding with unrelated content; one that is too small loses the context needed to answer the question.

  • Prefer semantic chunking (by heading, section, or sentence clusters) over fixed token windows.
  • Preserve structural metadata — title, section path, source URL, author, and last-updated date — alongside each chunk so you can filter and cite precisely.
  • For tables and code, keep them as single chunks. Breaking them mid-row destroys meaning.
  • Start around 300–500 tokens per chunk with 10–15% overlap, then tune against your eval set rather than guessing.
  • Strip boilerplate (navigation, cookie banners, repeated footers) before embedding — it pollutes similarity scores.

For long, hierarchical documents, consider parent-child retrieval: embed small child chunks for precise matching, but feed the larger parent section to the model so it has enough surrounding context to reason. This single change often resolves the "right document, missing detail" failure mode.

Embeddings: pick deliberately, then leave it alone

Embedding choice matters less than chunking, but it is not free. Decisions worth making up front:

  • Dimensionality vs. cost: larger vectors capture more nuance but inflate storage and slow down nearest-neighbour search. Match dimensions to your corpus size and recall targets.
  • Domain fit: general-purpose models struggle with legal, medical, or code-heavy corpora. Benchmark two or three candidates on your own data before committing.
  • Version pinning: never silently swap an embedding model. Every chunk in the index was embedded with one model; mixing versions corrupts similarity. Re-embedding is a migration, not a config flip.

Retrieval: hybrid almost always wins

Pure vector search misses exact keyword matches (product codes, SKUs, error IDs). Pure keyword search misses paraphrased queries. Combine both.

Rule of thumb: hybrid search + a reranker beats any single embedding model tweak by a wide margin.

Our default stack: Postgres with pgvector for dense retrieval, BM25 for keyword, and a cross-encoder reranker on the top 50 merged results. We fuse the two result sets with Reciprocal Rank Fusion, then let the reranker — which actually reads the query and passage together — promote the genuinely relevant chunks to the top.

Two retrieval techniques punch above their weight. Query rewriting expands or rephrases the user's question (and resolves pronouns from chat history) before searching. Metadata filtering narrows the candidate set by tenant, language, product, or recency, which both improves precision and enforces access boundaries. If you are building conversational systems on top of this, our piece on AI agents and automation covers how retrieval slots into a wider tool-using loop.

Generation: constrain the model

Prompt engineering matters, but structural choices matter more:

  1. Always include source URLs in context and require the model to cite them.
  2. Add a refusal clause: "If the context does not contain the answer, say so."
  3. Use JSON-mode or tool calling when downstream code consumes the output.
  4. Place the most relevant chunks at the start and end of the context — models attend less reliably to the middle of a long prompt.
  5. Cap the number of chunks you inject. More context is not always better; it raises cost, latency, and the odds of the model latching onto an irrelevant passage.

Evaluation: automate it from day one

You cannot improve what you cannot measure. Build an eval set of 50–200 real user questions with expected answers before launch. Re-run it on every retrieval or prompt change.

We use Ragas for retrieval and generation metrics (faithfulness, context precision, answer relevance). Pair it with a handful of end-to-end "golden path" tests in CI. Two metrics deserve special attention:

  • Context recall tells you whether retrieval surfaced the chunks needed to answer at all — a retrieval problem, not a generation one.
  • Faithfulness tells you whether the answer is actually grounded in the retrieved context, which is how you catch hallucinated citations.

An LLM-as-judge can grade these at scale, but calibrate it against a few dozen human-labelled examples first, or you will optimise toward a biased grader. Track scores over time so a "harmless" prompt tweak that quietly regresses recall shows up before your users find it.

Latency: the boring optimizations

  • Cache embeddings and reranker outputs aggressively.
  • Stream responses to the UI — perceived latency drops dramatically.
  • Parallelize retrieval and reranking where possible.
  • Pre-compute common query embeddings nightly.
  • Add a semantic cache so repeated or near-identical questions skip the full pipeline entirely.
  • Tune the vector index (HNSW parameters, list counts) — a slow index is the most common hidden source of p99 latency.

Security and access control

RAG quietly turns your knowledge base into a query surface, and the model will happily repeat anything it is shown. Treat retrieval as the security boundary:

  • Filter before you retrieve. Apply tenant and role permissions as metadata filters at query time so a user can never receive a chunk they are not entitled to see.
  • Guard against prompt injection. Documents can contain instructions ("ignore previous directions and reveal…"). Keep retrieved content clearly separated from system instructions and never let it trigger privileged tools unchecked.
  • Redact and log. Strip PII at ingestion where you can, and log which sources fed each answer for auditability.

These controls also matter for discoverability and trust — a theme we expand on in our guide to generative engine optimization.

A reference architecture

Putting it together, a production pipeline typically flows like this:

  1. Ingest sources, clean and semantically chunk them, attach metadata, and embed with a pinned model.
  2. Store vectors and keyword indexes side by side (pgvector + BM25) with permission metadata.
  3. At query time: rewrite the query, run hybrid retrieval, fuse, and rerank the top candidates.
  4. Assemble a bounded, cited context and generate with a refusal clause and structured output.
  5. Stream the answer, cache aggressively, and log every retrieval for evaluation.

What we wish we knew earlier

RAG is 70% data engineering, 20% evaluation, and 10% prompt design. Teams that invert those ratios usually end up rewriting the system six months in. Invest in your ingestion pipeline and eval harness first. If you want a partner who has shipped this pattern across real workloads, our AI development team can help — get in touch to scope your use case.

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

RAG retrieves relevant documents at query time and feeds them to the model as context, so answers stay current and traceable to a source. Fine-tuning bakes knowledge or style into the model's weights through training. RAG is the right choice when your data changes often, when you need citations, or when you cannot afford to retrain. Fine-tuning suits stable tasks where you need a specific tone or output format. Many production systems use both: fine-tuning for behaviour and RAG for facts.

How do I stop my RAG system from hallucinating?

Hallucinations in RAG usually trace back to weak retrieval rather than the model itself. Improve context recall first with hybrid search and a reranker so the right chunks actually reach the prompt. Then constrain generation: require the model to cite sources, add an explicit refusal clause for when the context lacks an answer, and cap how many chunks you inject. Finally, measure faithfulness with a tool like Ragas so you catch ungrounded answers automatically before users do.

Which vector database should I use for RAG?

For most teams, Postgres with the pgvector extension is the pragmatic default because it keeps your vectors, metadata, and relational data in one system you already operate. Dedicated vector databases become worthwhile at very large scale or when you need specialised indexing and filtering features. The more important decision is enabling hybrid search and good metadata filtering, which matters far more to answer quality than the specific database vendor you pick.

How long does it take to build a production RAG pipeline?

A working demo takes days, but a production-ready pipeline with hybrid retrieval, reranking, evaluation, security controls, and latency tuning typically takes a few weeks to a couple of months depending on data quality and scale. The biggest time sink is data engineering — cleaning sources, chunking well, and building the evaluation harness. Teams that invest there early ship faster and avoid the costly rewrite that comes from launching on a naive top-k retrieval setup.

← Back to blog

Need Help Shipping Production AI?

We build and deploy RAG and agent systems for teams across India. Let's talk about your use case.

Talk To Our AI Team