What does Rogue AI build?

Production AI systems for European businesses — RAG pipelines with pgvector and hybrid retrieval, AI agent orchestration with parallel execution and MCP integration, LLM integration connecting models to business APIs, and AI-powered security tools analyzing firewall rules across 33 vendors.

How much does a project cost?

AI Starter projects start at EUR 3,000-5,000 (2 weeks) for single-feature tools. AI Workflow projects run EUR 5,000-10,000 (3-4 weeks) for multi-step automation. AI Platform builds range EUR 10,000-20,000 (4-6 weeks) for complete systems with infrastructure.

How long does a typical project take?

2-4 weeks for a standard deployment. This includes requirements analysis, system design, development, testing, Docker containerization, and production deployment with monitoring. Complex multi-system integrations may take 4-6 weeks.

Where is my data processed?

All AI processing runs on EU-hosted infrastructure (Hetzner, Germany). We use locally-deployed models via Ollama — no data is sent to US cloud APIs like OpenAI or AWS. Documents are processed in-memory and automatically deleted. Your data never leaves the EU.

What industries do you serve?

European SMBs across maritime (document compliance), legal (contract analysis), construction (safety compliance), and any business needing AI automation. We build industry-specific tools with live demos you can test before committing.

Can I try before I buy?

Yes. Our industry demos are free with no signup — Maritime DocAI, Legal DocAI, and Construction DocAI are all live. Upload a real document and see results in 15-25 seconds. These demos show exactly the kind of tool we build for clients.

What tech stack do you use?

Next.js, React, TypeScript, PostgreSQL with pgvector for vector search, Redis for caching, Ollama for local LLM inference, Docker for containerization, and Caddy for reverse proxy. All systems deployed on Hetzner VPS with automated health monitoring.

Who operates Rogue AI?

Rogue AI is operated by Netshift Advisory Ltd (HE 489261), a Cyprus-registered company. The founder brings 17 years of cybersecurity experience across DACH enterprise markets with certifications including AI-102, AZ-500, CEH, ISO 27001 Lead Implementer, TOGAF 9, and CCIE Security.

Technical Guide

Production RAG Pipeline: Architecture & Retrieval Guide

Rogue AI

·2026-04-01·12 min read

Most RAG tutorials show you how to get something working in a Jupyter notebook in 20 lines of LangChain. Production RAG is a fundamentally different engineering problem. The difference between a demo that impresses in a meeting and a system that serves accurate answers at scale comes down to architecture decisions made before you write a single line of retrieval code.

This guide covers the architecture decisions that actually matter for production RAG pipelines — based on 20+ systems built and deployed, including CompliBot (38 API routes, full production RAG pipeline serving compliance teams) and systems processing thousands of documents daily.

What Makes RAG Hard in Production

Production RAG fails in three ways that tutorials never surface:

Retrieval precision collapses at scale

When your document corpus grows beyond a few hundred chunks, naive cosine similarity retrieval starts returning irrelevant results. The model hallucinates because its context window is full of near-misses.

Chunking strategy destroys context

Fixed-size character chunking splits sentences, tables, and code blocks mid-thought. The retrieved chunks are syntactically valid but semantically broken.

Latency is unacceptable without optimization

Embedding + retrieval + reranking + LLM generation chains can take 8-15 seconds on naive implementations. Production needs sub-2 seconds or users abandon.

The Production RAG Architecture

A production RAG pipeline has two separate concerns: the ingestion pipeline (runs once per document, offline) and the query pipeline (runs on every user request, must be fast). Never conflate them.

Ingestion Pipeline

Step 1: Document Processing

Parse each document type correctly. PDFs need layout-aware parsing (not just text extraction) — tables, headers, and footnotes matter. Use PyMuPDF or pdfplumber for PDFs. For Word/HTML, strip boilerplate headers/footers before chunking.

Real implementation: pre-process every document into a normalized markdown format before chunking. This lets you use the same chunking logic regardless of source format.

Step 2: Chunking Strategy

Do not use fixed-size character chunking. Use semantic chunking based on document structure: paragraph boundaries, section headers, or sentence groupings. For legal and compliance documents, chunk by section/clause — these are the natural retrieval units.

# Good: semantic chunking
chunks = split_by_headers(doc, max_tokens=512)

# Bad: fixed-size (destroys context)
chunks = [doc[i:i+512] for i in range(0, len(doc), 512)]

Overlap between chunks (10-15% recommended) prevents information loss at chunk boundaries. Store the overlap in metadata, not content, to avoid duplicate retrieval.

Step 3: Embedding + Storage

Use pgvector for production. It handles millions of vectors efficiently, integrates with your existing PostgreSQL stack, and supports HNSW indexes for sub-millisecond ANN search. Store chunk content, metadata (source file, page, section, date), and the embedding in the same table.

-- pgvector schema
CREATE TABLE chunks (
id uuid PRIMARY KEY,
content text,
metadata jsonb,
embedding vector(1536)
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

Query Pipeline

Step 4: Hybrid Retrieval

Combine semantic search (vector similarity) with keyword search (BM25 or PostgreSQL full-text search). Semantic search handles conceptual queries; keyword search handles exact terms, codes, and proper nouns. Neither alone is sufficient.

Merge results using Reciprocal Rank Fusion (RRF) — simple, parameter-free, consistently outperforms weighted averaging in practice.

# RRF fusion
def rrf_score(rank, k=60):
return 1.0 / (k + rank)

combined = rrf_merge(semantic_results, keyword_results)

Step 5: Reranking

After retrieval, rerank the top 20 chunks to select the best 5 for the LLM context. Use a cross-encoder model (e.g., ms-marco-MiniLM) — slow but highly accurate. Run this on a separate async worker to hide latency.

Without reranking, retrieval precision at k=5 is typically 60-70%. With reranking, 85-90%. The improvement in answer quality is significant.

Step 6: Context Construction + Generation

Build the LLM prompt with retrieved chunks + source citations. Include the source file, section, and page for every chunk. Instruct the model to cite its sources — this is what allows users to verify answers.

Use Ollama for local inference (llama3.2 or mistral for most use cases) or the Claude API for highest accuracy. Local inference eliminates data sovereignty concerns for enterprise customers.

Performance Architecture

Embedding cache

Cache query embeddings in Redis. Identical queries re-use cached embeddings. Reduces embedding latency by 90% for repeat queries.

Async ingestion queue

Process document uploads via a queue (Redis + worker). Never block the API on embedding generation. Users get immediate upload confirmation.

Streaming responses

Stream LLM output to the client. Time-to-first-token <500ms eliminates perceived latency even when full generation takes 3-4 seconds.

Confidence scoring

Score retrieval confidence before generation. If top reranked chunk has low similarity, return "I don't have enough information" rather than hallucinating.

The Stack That Works in Production

Component	Tool	Why
Vector store	pgvector + PostgreSQL	No separate infrastructure, HNSW index, SQL queries on metadata
Embeddings	OpenAI text-embedding-3-small	Best cost/quality ratio for most use cases
Reranking	ms-marco-MiniLM cross-encoder	Open source, runs locally, strong performance
LLM inference	Ollama (local) or Claude API	Local for data sovereignty, Claude API for highest accuracy
Queue	Redis + BullMQ	Reliable, well-documented, integrates with Next.js
API	Next.js 16 + TypeScript	Same stack for frontend + API routes, server components

Need a Production RAG System?

Rogue AI builds production RAG pipelines end-to-end — document ingestion, pgvector, hybrid retrieval, Ollama/Claude integration, and full-stack deployment in Docker. Delivered in weeks, not quarters.

Get in touch or see all capabilities.