Private-AI Bridge & RAG

One LLM gateway and a shared knowledge store for the whole fleet

Built by Rogue AI · Shared LLM gateway + self-hosted RAG · Internal infrastructure

Built as the backbone every other app plugs into, then hardened over successive iterations as more apps came online: caching, an allowlist, rate limits, shared-secret auth, metrics, and a retrieval layer added as the fleet's needs grew.

Private-AI Bridge & RAG, One LLM gateway and a shared knowledge store for the whole fleet

The problem

A fleet of twenty-odd apps that each call an LLM means twenty copies of the same provider code, twenty places to rotate a key, and twenty ways to get retrieval subtly wrong. Worse, every app that wants to answer questions over its own documents needs an embedding model, a vector store, chunking, and reranking, none of which belong in the app itself. The job was to give the whole fleet one place to send model calls and one place to store and search knowledge, locally, without every app carrying that weight.

What I built

A single self-hosted gateway every app talks to instead of a model provider directly. It routes a request to local Ollama or to Claude through one switch, caches repeated calls, enforces a per-app allowlist and rate limits behind a shared secret, and exposes metrics. Bolted to it is a shared retrieval layer: a Qdrant vector store reached through the same gateway, with chunk-on-ingest, reranking, and embeddings generated locally. Apps upsert and search their own collections; the gateway owns the embedding model, the vector store, and the retrieval quality so the apps do not have to.

Architecture

Provider-switchable LLM gateway

One HTTP service that fronts both local Ollama and Claude. Apps send a model-agnostic request and a single switch decides the backend, so a key or model change happens in one place and no app holds a provider client of its own.

Shared RAG over Qdrant

A self-hosted Qdrant vector store, reached only through the gateway's retrieval endpoints (upsert, search, batch, delete) with chunk-on-ingest and an optional rerank pass. Each app keeps its own collection while the retrieval logic is written once.

Local embeddings

Embeddings are generated by a local Ollama model (nomic-embed-text, 768-dim), so document text is turned into vectors on the same host and never sent to a hosted embedding API.

Allowlist, rate limits, and shared-secret auth

Every caller is an allow-listed app authenticated by a shared secret, with per-app rate limits and fail-closed defaults, so a misbehaving or unknown caller is refused rather than served.

Caching and metrics

An LRU cache collapses repeated calls and a metrics endpoint exposes usage and latency, so the one shared dependency stays observable and cheap to run.

Hardened single-network Docker

Gateway and vector store run as containers on an isolated network, ports bound to loopback, with a persistent volume for the knowledge store so embeddings survive restarts.

Tech stack

Node.jsExpressQdrantOllamanomic-embed-textDocker

What broke first

▸
One gateway beats per-app SDK code. The day a provider key or model id changes, it changes in one place, and every app keeps working without a redeploy, because none of them hold a provider client of their own.
▸
Retrieval quality comes down to chunking and reranking, not model choice. Wiring chunk-on-ingest and a rerank pass into the shared endpoints raised answer quality for every consumer at once, instead of each app reinventing it badly.
▸
A shared service is a shared blast radius. An allowlist, per-app rate limits, a shared secret, and fail-closed defaults stop being optional the moment more than one app depends on the same door, because the cost of getting it wrong is the whole fleet, not one app.

Outcome

Every app in the fleet routes its model calls and its document search through one self-hosted service instead of carrying provider code and a vector store of its own. A provider or model change is a one-line edit, retrieval quality improves for every consumer at once, and nothing, neither prompts nor documents, leaves the host. It is the shared spine the productised private-AI stack is built on.

Honest limits

This is internal infrastructure, not a product. It runs in a local lab as the shared core of a self-hosted fleet, bound to localhost behind a shared secret. The honest trade-off is centralisation: every app routes its model calls and retrieval through one service, so the bridge is both the convenient single control point and a single dependency, mitigated with caching and fail-closed defaults rather than removed.