Technical Guide

Production RAG Pipeline: Architecture & Retrieval Guide

R
Rogue AI
··12 min read

Most RAG tutorials show you how to get something working in a Jupyter notebook in 20 lines of LangChain. Production RAG is a fundamentally different engineering problem. The difference between a demo that impresses in a meeting and a system that serves accurate answers at scale comes down to architecture decisions made before you write a single line of retrieval code.

This guide covers the architecture decisions that actually matter for production RAG pipelines — based on 20+ systems built and deployed, including CompliBot (38 API routes, full production RAG pipeline serving compliance teams) and systems processing thousands of documents daily.

What Makes RAG Hard in Production

Production RAG fails in three ways that tutorials never surface:

Retrieval precision collapses at scale

When your document corpus grows beyond a few hundred chunks, naive cosine similarity retrieval starts returning irrelevant results. The model hallucinates because its context window is full of near-misses.

Chunking strategy destroys context

Fixed-size character chunking splits sentences, tables, and code blocks mid-thought. The retrieved chunks are syntactically valid but semantically broken.

Latency is unacceptable without optimization

Embedding + retrieval + reranking + LLM generation chains can take 8-15 seconds on naive implementations. Production needs sub-2 seconds or users abandon.

The Production RAG Architecture

A production RAG pipeline has two separate concerns: the ingestion pipeline (runs once per document, offline) and the query pipeline (runs on every user request, must be fast). Never conflate them.

Ingestion Pipeline

Step 1: Document Processing

Parse each document type correctly. PDFs need layout-aware parsing (not just text extraction) — tables, headers, and footnotes matter. Use PyMuPDF or pdfplumber for PDFs. For Word/HTML, strip boilerplate headers/footers before chunking.

Real implementation: pre-process every document into a normalized markdown format before chunking. This lets you use the same chunking logic regardless of source format.

Step 2: Chunking Strategy

Do not use fixed-size character chunking. Use semantic chunking based on document structure: paragraph boundaries, section headers, or sentence groupings. For legal and compliance documents, chunk by section/clause — these are the natural retrieval units.

# Good: semantic chunking
chunks = split_by_headers(doc, max_tokens=512)

# Bad: fixed-size (destroys context)
chunks = [doc[i:i+512] for i in range(0, len(doc), 512)]

Overlap between chunks (10-15% recommended) prevents information loss at chunk boundaries. Store the overlap in metadata, not content, to avoid duplicate retrieval.

Step 3: Embedding + Storage

Use pgvector for production. It handles millions of vectors efficiently, integrates with your existing PostgreSQL stack, and supports HNSW indexes for sub-millisecond ANN search. Store chunk content, metadata (source file, page, section, date), and the embedding in the same table.

-- pgvector schema
CREATE TABLE chunks (
id uuid PRIMARY KEY,
content text,
metadata jsonb,
embedding vector(1536)
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

Query Pipeline

Step 4: Hybrid Retrieval

Combine semantic search (vector similarity) with keyword search (BM25 or PostgreSQL full-text search). Semantic search handles conceptual queries; keyword search handles exact terms, codes, and proper nouns. Neither alone is sufficient.

Merge results using Reciprocal Rank Fusion (RRF) — simple, parameter-free, consistently outperforms weighted averaging in practice.

# RRF fusion
def rrf_score(rank, k=60):
return 1.0 / (k + rank)

combined = rrf_merge(semantic_results, keyword_results)

Step 5: Reranking

After retrieval, rerank the top 20 chunks to select the best 5 for the LLM context. Use a cross-encoder model (e.g., ms-marco-MiniLM) — slow but highly accurate. Run this on a separate async worker to hide latency.

Without reranking, retrieval precision at k=5 is typically 60-70%. With reranking, 85-90%. The improvement in answer quality is significant.

Step 6: Context Construction + Generation

Build the LLM prompt with retrieved chunks + source citations. Include the source file, section, and page for every chunk. Instruct the model to cite its sources — this is what allows users to verify answers.

Use Ollama for local inference (llama3.2 or mistral for most use cases) or the Claude API for highest accuracy. Local inference eliminates data sovereignty concerns for enterprise customers.

Performance Architecture

Embedding cache

Cache query embeddings in Redis. Identical queries re-use cached embeddings. Reduces embedding latency by 90% for repeat queries.

Async ingestion queue

Process document uploads via a queue (Redis + worker). Never block the API on embedding generation. Users get immediate upload confirmation.

Streaming responses

Stream LLM output to the client. Time-to-first-token <500ms eliminates perceived latency even when full generation takes 3-4 seconds.

Confidence scoring

Score retrieval confidence before generation. If top reranked chunk has low similarity, return "I don't have enough information" rather than hallucinating.

The Stack That Works in Production

ComponentToolWhy
Vector storepgvector + PostgreSQLNo separate infrastructure, HNSW index, SQL queries on metadata
EmbeddingsOpenAI text-embedding-3-smallBest cost/quality ratio for most use cases
Rerankingms-marco-MiniLM cross-encoderOpen source, runs locally, strong performance
LLM inferenceOllama (local) or Claude APILocal for data sovereignty, Claude API for highest accuracy
QueueRedis + BullMQReliable, well-documented, integrates with Next.js
APINext.js 16 + TypeScriptSame stack for frontend + API routes, server components

Need a Production RAG System?

Rogue AI builds production RAG pipelines end-to-end — document ingestion, pgvector, hybrid retrieval, Ollama/Claude integration, and full-stack deployment in Docker. Delivered in weeks, not quarters.

Get in touch or see all capabilities.

Rogue AI • Production Systems •