Building LLM Features That Survive Production

A team we talked to spent six weeks fine-tuning a model to stop it giving wrong answers. The fine-tune barely moved the numbers, because the wrong answers were never a model problem. The system was retrieving the wrong documents, and no prompt or weight update can fix bad inputs. Two days of work on the retrieval step would have closed most of the gap. They reached for the heaviest tool in the box first, and it cost them a month and a half.
This is the most common expensive mistake in llm feature development, and it comes from doing the stages out of order. The order that survives contact with real users is plain: integrate, then evaluate, then fine-tune. You only reach the third stage when the evidence from the second one says you have to. This piece is about the sequencing, not the mechanics. Each stage has its own deep-dive; the point here is knowing when and why each one earns its place.
Why teams reach for the wrong tool first
Fine-tuning feels like the serious answer. It is what a “real” ML team does. Prompting feels like cheating, retrieval feels like plumbing, and adjusting model weights feels like engineering. So that is where the attention goes: toward the lever that signals competence rather than the one that solves the problem.
The trouble is that fine-tuning is also the most expensive lever you have, the slowest to iterate, and the hardest to debug. A prompt change is a thirty-second loop. A retrieval change is an afternoon. A fine-tune is hours of GPU time per iteration, a dataset you have to build and curate, and a feedback cycle measured in days. When something goes wrong after a fine-tune, you cannot read the weights to see why. You are debugging a black box you just made more opaque. Reaching for it first means paying the highest iteration cost in the stack to fix a problem you have not yet localised.
Stage 1: Integrate — make it work before you make it good
The first stage is a working path from user input to model output, grounded in your data, that holds up under load. Three things carry most of the weight here, in roughly this order of leverage.
Prompt design is the cheapest, fastest lever you own. A clear instruction, a few well-chosen examples, and a defined output shape will get you further than most teams expect before they ever consider touching weights. Exhaust it first, because the iteration loop is measured in seconds.
Retrievalis the default way to give a model your data. The model does not know your contracts, your product docs, or last quarter’s tickets, so you fetch the relevant pieces at query time and put them in the context window. This is where most “the model is wrong” complaints actually live, and getting it right is its own discipline. We go deep on that in the production RAG pipeline guide.
The plumbingis the unglamorous 80% of the real work: streaming so the interface feels alive, timeouts and retries so a slow upstream does not hang the request, fallbacks for when the model or the retriever is down, and a hard eye on token cost before it surprises you on the invoice. None of this shows up in a demo, and all of it shows up in production. The wiring patterns — request shape, error handling, where the model sits relative to your existing services — are covered in integrating LLMs into business systems.
Most features should ship after this stage. A well-prompted model with good retrieval and solid plumbing is a real product. You have not tuned anything, and that is the point.
Stage 2: Evaluate — you cannot improve what you cannot measure
Before you change anything to make the system “better”, you need a fixed point that tells you whether a change actually helped. That fixed point is an eval set: a collection of representative inputs with known-good expected outputs that you run on every change. Without it, “better” is a vibe. Someone watched it answer two questions well and declared progress.
Build the eval set before you tune anything, because it is the only thing that lets you compare two versions honestly. With it, a prompt tweak, a retrieval change, and a fine-tune are all measurable against the same bar. Without it, you are flying blind and calling it intuition. The how — what to put in the set, how to score open-ended outputs, how to keep it honest as the product grows — is its own topic, covered in the LLM evaluation guide. For sequencing, the rule is simple: no eval set, no fine-tune.
Stage 3: Fine-tune, and only now
Fine-tuning earns its place when the evidence from Stage 2 shows that prompting and retrieval have hit a real ceiling, not a suspected one. Three signals genuinely justify a LoRA fine-tune:
- A capability the model cannot reach through prompting, no matter how you phrase it: a domain reasoning pattern or a task structure the base model just does not have.
- A style or format you need every single time— a house tone, a strict output schema, a domain dialect — that examples in the prompt only get right most of the time. Baking it into the weights makes it the default rather than a request.
- Latency or cost from giant prompts. When you are stuffing thousands of tokens of instructions and examples into every call, fine-tuning that behaviour in lets you send a far shorter prompt and pay for it once in training rather than on every request.
The one distinction that matters
Fine-tuning teaches behaviour; retrieval supplies facts. Do not fine-tune to inject knowledge. A model that “learned” your documents in training will confidently invent the parts it half remembers, and you cannot update it without retraining. Facts belong in the context window, fetched fresh. Weights are for how the model acts, not what it knows.
When you do reach this stage, LoRA is the practical default. It adapts a small set of parameters instead of the whole model, which keeps the iteration cost and the hardware demand sane. The mechanics, the dataset work, and the gotchas are in the LoRA fine-tuning guide.
The loop, not the line
Integrate → evaluate → fine-tune reads like a line, but you run it as a loop. Evaluation is not a one-time gate at the end; it gates every change. A prompt edit, a new retriever, a reranker: each one goes through the same eval set, and you keep the version that moved the bar.
You will cycle between integrate and evaluate many times before fine-tuning ever comes up. And the loop does not close at launch. Production failures — the real questions users ask that your system gets wrong — feed straight back into the eval set, which makes the next round of changes measurable against the failures that actually happened rather than the ones you imagined. The eval set is a living artefact, not a checkbox. Most features never reach Stage 3 at all, and that is a healthy outcome, not a gap.
Where this goes wrong
Fine-tuning to fix hallucination
Hallucination is almost always a retrieval problem: the model had the wrong context or no context. Teaching the weights to memorise harder makes it worse, not better. Fix the inputs first.
No eval set, so “better” is a vibe
Without a measured baseline, every change is a guess and every claim of improvement is unfalsifiable. You will chase regressions you cannot see and ship changes that quietly make things worse.
Skipping integration hardening
The demo works because it ran once, on a fast day, with no load. Production times out, the upstream hiccups, the token bill spikes, and there is no fallback. The plumbing was the work all along.
Each of these is a stage done out of order, and together they are most of what we see in the AI projects that fail before production. The fix is not a smarter model. It is doing the cheap stages properly before reaching for the expensive one.
Closing
The order is the whole discipline. Integrate until you have a real working path, evaluate so you know what “better” means, and treat fine-tuning as the thing you do last, rarely, and only with evidence. Most of the value lives in the first two stages, and most of the wasted months live in skipping them.
Related reading: for the wiring and the data plumbing, start with integrating LLMs into business systems and the production RAG pipeline guide. When you are ready to measure, the LLM evaluation guide is the fixed point everything else is judged against.
Quick Reference
The three stages, and when each one earns its place
| Stage | What it does | Iteration cost | Reach for it when |
|---|---|---|---|
| Integrate | Prompt + retrieval + plumbing | Seconds to an afternoon | Always first — most features ship here |
| Evaluate | An eval set that gates every change | Build once, run forever | Before you tune anything |
| Fine-tune (LoRA) | Bake behaviour into the weights | Hours of GPU per loop | Only when evidence shows a real ceiling |
Frequently Asked Questions
Should I fine-tune or use RAG?
Use retrieval (RAG) to give a model facts, and fine-tuning to change how it behaves. Most teams reach for fine-tuning when the real fix was retrieval. If the model is wrong about your data, that is almost always a retrieval problem, not a weights problem.
When is fine-tuning actually worth it?
Only after prompting and retrieval have hit a measured ceiling. Three signals justify it: a capability prompting cannot reach, a style or format you need every single time, or latency and cost from giant prompts you want to bake in. If you cannot point to one of those with evidence, do not fine-tune.
Does fine-tuning fix hallucination?
No. Hallucination is almost always a retrieval problem — the model had the wrong context or none. Teaching the weights to memorise harder makes it worse. Fix the inputs first.
Why do most LLM features fail before production?
They do the stages out of order: fine-tuning before measuring, no eval set so 'better' is a guess, or skipping the integration hardening so the demo works but production times out. The fix is not a smarter model — it is doing the cheap stages properly first.