Production RAG Pipeline: Architecture & Retrieval Guide
Most RAG tutorials show you how to get something working in a Jupyter notebook in 20 lines of LangChain. Production RAG is a fundamentally different engineering problem. The difference between a demo that impresses in a meeting and a system that serves accurate answers at scale comes down to architecture decisions made before you write a single line of retrieval code.
This guide covers the architecture decisions that actually matter for production RAG pipelines — based on 20+ systems built and deployed, including CompliBot (38 API routes, full production RAG pipeline serving compliance teams) and systems processing thousands of documents daily.
What Makes RAG Hard in Production
Production RAG fails in three ways that tutorials never surface:
Retrieval precision collapses at scale
When your document corpus grows beyond a few hundred chunks, naive cosine similarity retrieval starts returning irrelevant results. The model hallucinates because its context window is full of near-misses.
Chunking strategy destroys context
Fixed-size character chunking splits sentences, tables, and code blocks mid-thought. The retrieved chunks are syntactically valid but semantically broken.
Latency is unacceptable without optimization
Embedding + retrieval + reranking + LLM generation chains can take 8-15 seconds on naive implementations. Production needs sub-2 seconds or users abandon.
The Production RAG Architecture
A production RAG pipeline has two separate concerns: the ingestion pipeline (runs once per document, offline) and the query pipeline (runs on every user request, must be fast). Never conflate them.
Ingestion Pipeline
Step 1: Document Processing
Parse each document type correctly. PDFs need layout-aware parsing (not just text extraction) — tables, headers, and footnotes matter. Use PyMuPDF or pdfplumber for PDFs. For Word/HTML, strip boilerplate headers/footers before chunking.
Real implementation: pre-process every document into a normalized markdown format before chunking. This lets you use the same chunking logic regardless of source format.
Step 2: Chunking Strategy
Do not use fixed-size character chunking. Use semantic chunking based on document structure: paragraph boundaries, section headers, or sentence groupings. For legal and compliance documents, chunk by section/clause — these are the natural retrieval units.
chunks = split_by_headers(doc, max_tokens=512)
# Bad: fixed-size (destroys context)
chunks = [doc[i:i+512] for i in range(0, len(doc), 512)]
Overlap between chunks (10-15% recommended) prevents information loss at chunk boundaries. Store the overlap in metadata, not content, to avoid duplicate retrieval.
Step 3: Embedding + Storage
Use pgvector for production. It handles millions of vectors efficiently, integrates with your existing PostgreSQL stack, and supports HNSW indexes for sub-millisecond ANN search. Store chunk content, metadata (source file, page, section, date), and the embedding in the same table.
CREATE TABLE chunks (
id uuid PRIMARY KEY,
content text,
metadata jsonb,
embedding vector(1536)
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
Query Pipeline
Step 4: Hybrid Retrieval
Combine semantic search (vector similarity) with keyword search (BM25 or PostgreSQL full-text search). Semantic search handles conceptual queries; keyword search handles exact terms, codes, and proper nouns. Neither alone is sufficient.
Merge results using Reciprocal Rank Fusion (RRF) — simple, parameter-free, consistently outperforms weighted averaging in practice.
def rrf_score(rank, k=60):
return 1.0 / (k + rank)
combined = rrf_merge(semantic_results, keyword_results)
Step 5: Reranking
After retrieval, rerank the top 20 chunks to select the best 5 for the LLM context. Use a cross-encoder model (e.g., ms-marco-MiniLM) — slow but highly accurate. Run this on a separate async worker to hide latency.
Without reranking, retrieval precision at k=5 is typically 60-70%. With reranking, 85-90%. The improvement in answer quality is significant.
Step 6: Context Construction + Generation
Build the LLM prompt with retrieved chunks + source citations. Include the source file, section, and page for every chunk. Instruct the model to cite its sources — this is what allows users to verify answers.
Use Ollama for local inference (llama3.2 or mistral for most use cases) or the Claude API for highest accuracy. Local inference eliminates data sovereignty concerns for enterprise customers.
Performance Architecture
Embedding cache
Cache query embeddings in Redis. Identical queries re-use cached embeddings. Reduces embedding latency by 90% for repeat queries.
Async ingestion queue
Process document uploads via a queue (Redis + worker). Never block the API on embedding generation. Users get immediate upload confirmation.
Streaming responses
Stream LLM output to the client. Time-to-first-token <500ms eliminates perceived latency even when full generation takes 3-4 seconds.
Confidence scoring
Score retrieval confidence before generation. If top reranked chunk has low similarity, return "I don't have enough information" rather than hallucinating.
The Stack That Works in Production
| Component | Tool | Why |
|---|---|---|
| Vector store | pgvector + PostgreSQL | No separate infrastructure, HNSW index, SQL queries on metadata |
| Embeddings | OpenAI text-embedding-3-small | Best cost/quality ratio for most use cases |
| Reranking | ms-marco-MiniLM cross-encoder | Open source, runs locally, strong performance |
| LLM inference | Ollama (local) or Claude API | Local for data sovereignty, Claude API for highest accuracy |
| Queue | Redis + BullMQ | Reliable, well-documented, integrates with Next.js |
| API | Next.js 16 + TypeScript | Same stack for frontend + API routes, server components |
Need a Production RAG System?
Rogue AI builds production RAG pipelines end-to-end — document ingestion, pgvector, hybrid retrieval, Ollama/Claude integration, and full-stack deployment in Docker. Delivered in weeks, not quarters.