Technical Guide

LLM Integration for Business Systems: A Practical Guide

R
Rogue AI
··10 min read

Every business wants AI capabilities. The pitch is always the same: connect our systems to an LLM, automate the boring work, unlock insights from our data. Most integration projects fail not because the models are bad, but because the integration is poorly architected. Wrong model for the use case. No caching. Synchronous calls on hot paths. No fallback when the API is down.

This guide covers practical patterns for connecting LLMs to real business systems — based on 7 live production applications running Ollama locally or the Claude API. No hype, just what works.

Choosing the Right Model for the Job

The most common mistake is defaulting to the biggest, most expensive model for every task. Most LLM integration tasks need a small, fast, cheap model — not a frontier model. Use frontier models for tasks that require them.

Task typeRecommended modelWhy
Text classificationLocal: llama3.2 3BFast, cheap, accurate enough for most categories
SummarizationLocal: mistral 7BExcellent at compression tasks, runs on consumer GPU
Structured extractionClaude Haiku / GPT-4o-miniJSON output reliability requires instruction-following
Code generationClaude SonnetComplex reasoning, tool use, large context window
Complex reasoning / analysisClaude OpusReserve for highest-value tasks — 10x cost of Sonnet
RAG answers with citationsLocal: llama3.1 + Claude API fallbackLocal for speed/privacy, API fallback for complex queries

Local Inference with Ollama

For most European business applications, local inference is the right default. Data stays on-premises, there are no per-token API costs, and response time is deterministic. Ollama makes this straightforward.

# Ollama OpenAI-compatible endpoint
import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but unused
)

response = client.chat.completions.create(
model="llama3.2",
messages=[...]
)

Data sovereignty

No data leaves your infrastructure. Critical for GDPR compliance, legal documents, financial data, and healthcare records.

Predictable cost

Fixed compute cost, no per-token billing. 10,000 queries/day at Claude API rates = €300-500/month. Locally, the same is effectively free after hardware.

No rate limits

API rate limits create unpredictable latency spikes. Local inference is bounded only by GPU throughput.

Customizable

Load fine-tuned models, swap models without code changes, run multiple models simultaneously for different tasks.

Production Integration Patterns

Pattern 1: Streaming Responses

Always stream LLM responses in user-facing interfaces. Time-to-first-token <500ms makes 4-second generations feel instant. Never wait for the full response before showing output.

// Next.js streaming response
const stream = await client.chat.completions.create({
model: "llama3.2",
messages,
stream: true
});

for await (const chunk of stream) {
controller.enqueue(chunk.choices[0]?.delta?.content ?? "");
}

Pattern 2: Response Caching

Cache LLM responses for identical or semantically similar inputs. A support FAQ system where 80% of queries are variations of the same 10 questions should not call the LLM for each one.

Use Redis with a composite key of model + normalized prompt hash. For semantic similarity caching, embed the query and check for cached responses above a cosine similarity threshold (0.95 works well).

Pattern 3: Structured Output

Never parse free-text LLM output. Use structured output (JSON mode or tool use) for any integration that downstream code depends on.

// Force JSON output — parse reliably
response_format: { type: "json_object" }

Pattern 4: Fallback Chain

Local model → API model → cached response → graceful error. Never let a single point of failure break your integration. When Ollama is down, fall back to the API. When the API is rate-limited, serve a cached response.

async function generateWithFallback(prompt) {
try { return await ollamaGenerate(prompt) }
catch { return await claudeGenerate(prompt) }
}

What This Looks Like in Practice

The 7 production applications running LLM integration at Rogue AI use this exact architecture:

CompliBotRAG + Ollama + pgvector — compliance Q&A with source citations
FwChangeOllama for firewall rule conflict analysis and plain-English explanations
VidPipeWhisper for transcription → Claude for content analysis and tagging
IntelBriefsPerplexity API + Claude for structured deep research reports
NetMapClaude for network topology analysis and risk narrative generation
Avatar StudioLoRA fine-tuning + Wan 2.2 + Claude for prompt generation
AI LabOpen WebUI + Ollama + LlamaFactory — local model training and inference workbench

Need LLM Integration in Your Systems?

Rogue AI integrates Ollama, Claude API, and custom models into real business systems — with streaming responses, caching, fallback chains, and production Docker deployment. Built to run, not to demo.

Get in touch or see all capabilities.

Rogue AI • Production Systems •