LLM Integration for Business Systems: A Practical Guide
Every business wants AI capabilities. The pitch is always the same: connect our systems to an LLM, automate the boring work, unlock insights from our data. Most integration projects fail not because the models are bad, but because the integration is poorly architected. Wrong model for the use case. No caching. Synchronous calls on hot paths. No fallback when the API is down.
This guide covers practical patterns for connecting LLMs to real business systems — based on 7 live production applications running Ollama locally or the Claude API. No hype, just what works.
Choosing the Right Model for the Job
The most common mistake is defaulting to the biggest, most expensive model for every task. Most LLM integration tasks need a small, fast, cheap model — not a frontier model. Use frontier models for tasks that require them.
| Task type | Recommended model | Why |
|---|---|---|
| Text classification | Local: llama3.2 3B | Fast, cheap, accurate enough for most categories |
| Summarization | Local: mistral 7B | Excellent at compression tasks, runs on consumer GPU |
| Structured extraction | Claude Haiku / GPT-4o-mini | JSON output reliability requires instruction-following |
| Code generation | Claude Sonnet | Complex reasoning, tool use, large context window |
| Complex reasoning / analysis | Claude Opus | Reserve for highest-value tasks — 10x cost of Sonnet |
| RAG answers with citations | Local: llama3.1 + Claude API fallback | Local for speed/privacy, API fallback for complex queries |
Local Inference with Ollama
For most European business applications, local inference is the right default. Data stays on-premises, there are no per-token API costs, and response time is deterministic. Ollama makes this straightforward.
import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but unused
)
response = client.chat.completions.create(
model="llama3.2",
messages=[...]
)
Data sovereignty
No data leaves your infrastructure. Critical for GDPR compliance, legal documents, financial data, and healthcare records.
Predictable cost
Fixed compute cost, no per-token billing. 10,000 queries/day at Claude API rates = €300-500/month. Locally, the same is effectively free after hardware.
No rate limits
API rate limits create unpredictable latency spikes. Local inference is bounded only by GPU throughput.
Customizable
Load fine-tuned models, swap models without code changes, run multiple models simultaneously for different tasks.
Production Integration Patterns
Pattern 1: Streaming Responses
Always stream LLM responses in user-facing interfaces. Time-to-first-token <500ms makes 4-second generations feel instant. Never wait for the full response before showing output.
const stream = await client.chat.completions.create({
model: "llama3.2",
messages,
stream: true
});
for await (const chunk of stream) {
controller.enqueue(chunk.choices[0]?.delta?.content ?? "");
}
Pattern 2: Response Caching
Cache LLM responses for identical or semantically similar inputs. A support FAQ system where 80% of queries are variations of the same 10 questions should not call the LLM for each one.
Use Redis with a composite key of model + normalized prompt hash. For semantic similarity caching, embed the query and check for cached responses above a cosine similarity threshold (0.95 works well).
Pattern 3: Structured Output
Never parse free-text LLM output. Use structured output (JSON mode or tool use) for any integration that downstream code depends on.
response_format: { type: "json_object" }
Pattern 4: Fallback Chain
Local model → API model → cached response → graceful error. Never let a single point of failure break your integration. When Ollama is down, fall back to the API. When the API is rate-limited, serve a cached response.
try { return await ollamaGenerate(prompt) }
catch { return await claudeGenerate(prompt) }
}
What This Looks Like in Practice
The 7 production applications running LLM integration at Rogue AI use this exact architecture:
Need LLM Integration in Your Systems?
Rogue AI integrates Ollama, Claude API, and custom models into real business systems — with streaming responses, caching, fallback chains, and production Docker deployment. Built to run, not to demo.