RAG Architecture for Real Applications: Beyond the Tutorial

TL;DR

RAG (Retrieval-Augmented Generation) tutorials show the happy path — embed docs, retrieve top-k, stuff into prompt. Production is harder. Chunking strategy matters more than model choice. Use hybrid search (dense + sparse BM25). Re-rank retrieved docs before passing to the LLM. Build a golden eval dataset. Handle retrieval failures gracefully.

Most RAG tutorials show you the happy path: embed some documents, store them in a vector database, retrieve the top-k results, stuff them into the prompt. It works well enough in a demo. Here's what you actually need to think about for a production RAG system where accuracy matters.

Chunking Strategy Matters More Than Model Choice

The most common RAG failure we see is bad chunking. Splitting documents at fixed character counts — the default in most tutorials — is a recipe for retrieval that misses context. A chunk that cuts mid-sentence, or mid-table, or mid-code-block will retrieve poorly and confuse the model.

Use semantic chunking where possible: split at paragraph boundaries, section headings, or semantic units. For structured content like documentation or legal text, chunk by section. For conversational content, chunk by topic shift. The extra implementation effort pays back immediately in retrieval quality.

Retrieval Quality is the Whole Game

Improving your LLM from GPT-3.5 to GPT-4 might improve output quality by 20%. Improving your retrieval from mediocre to good improves it by 200%. The model can only work with what you give it — if the retrieved chunks don't contain the answer, no model will hallucinate it correctly.

Measure retrieval quality separately from end-to-end answer quality. If retrieval is failing, fix retrieval first. Adding a better model on top of broken retrieval is expensive and doesn't fix the root cause.

Hybrid Search: Dense + Sparse

Pure vector (dense) search is great for semantic similarity but bad for exact matches. If a user asks about a specific product code, a clause number, or a proper noun, vector search may return semantically similar but wrong results.

Hybrid search — combining dense vector retrieval with sparse BM25 keyword retrieval — handles both cases well. Most production RAG systems we've built use hybrid search with a weighted combination of both scores. It adds implementation complexity but eliminates a whole class of retrieval failures.

Re-ranking

Retrieve more documents than you need, then re-rank them before passing to the LLM. A cross-encoder re-ranker (or a dedicated re-ranking model) considers the query and each document together, rather than in isolation, and produces much better relevance scores than the initial retrieval step.

The pattern: retrieve top-20 with vector search, re-rank to top-5, pass top-5 to the model. This pipeline consistently outperforms retrieve-top-5 directly.

Evaluation is Non-Negotiable

A RAG system without evals is a RAG system you can't improve safely. Build a golden dataset of question-answer pairs from your actual document corpus. Measure retrieval recall (did the right chunk get retrieved?), answer correctness, and faithfulness (did the answer stay grounded in the retrieved content?).

Run evals on every change to your chunking strategy, embedding model, retrieval parameters, or prompt. What feels like an improvement in ad-hoc testing is often a regression somewhere else in the dataset.

Handling Failures Gracefully

What happens when retrieval returns nothing useful? Your RAG system needs a graceful degradation path — acknowledge uncertainty rather than hallucinate, offer to escalate to a human, or surface the raw search results for the user to browse.

The systems that erode user trust fastest are the ones that confidently answer questions they don't have the context to answer correctly.

Key Takeaways

1Chunking strategy has more impact on RAG quality than model choice — use semantic chunking, not fixed character splits.
2Measure retrieval quality separately from end-to-end answer quality; fix retrieval before upgrading your LLM.
3Hybrid search (dense vector + sparse BM25) handles both semantic and exact-match queries.
4Retrieve top-20, re-rank to top-5 — cross-encoder re-ranking consistently outperforms direct top-k retrieval.
5Build a golden eval dataset (question-answer pairs from your corpus) and run it on every significant change.

RAG Architecture for Real Applications: Beyond the Tutorial

Chunking Strategy Matters More Than Model Choice

Retrieval Quality is the Whole Game

Hybrid Search: Dense + Sparse

Re-ranking

Evaluation is Non-Negotiable

Handling Failures Gracefully

Key Takeaways

About the Author

Building something with AI?

More Articles

See Our AI Work