Technical Guide

LLM Evaluation: How to Test AI Systems Before Production

VarnaAI Founder

·2026-05-14·13 min read

Every AI project has a moment where someone says "it works" — usually after watching the system handle three or four inputs in a meeting. That is a demo, not a measurement. The gap between a system that works in the demo and a system you can put in front of real users is almost entirely an evaluation gap: you cannot improve, ship, or defend what you have not measured. This guide covers how to build evaluation into an AI system from the start — what to measure, how to score it without burning a budget, and why offline numbers alone will still let a broken system through.

Why "It Works" Is Not a Measurement

A demo is a sample of size three, hand-picked, run once. "It works" under those conditions tells you the system is not completely broken. It tells you nothing else. The questions that actually decide whether you have a product are different ones: does it work on inputs you did not choose? Does it still work after you changed the prompt? Is it better or worse than it was last week? Is it good enough to ship — and how would you know?

None of those have answers without a repeatable measurement. The teams that ship reliable AI systems are not smarter than the ones that do not. They decided what "good" means before they started building, wrote it down as a test, and ran that test on every change. Everything else in this guide is mechanics. That decision is the actual discipline, and it is the most common thing missing from the AI projects that fail before production.

What You Are Actually Evaluating

"Evaluation" gets used as if it means one thing. It means three, and conflating them is how teams end up with a high score and an unhappy user base at the same time.

Task quality — does the output do the job

Correctness, completeness, format adherence, faithfulness to the source material. This is what most people picture when they hear "evaluation," and it is only one third of the picture.

Regressions — did a change break something else

Fixing one prompt almost always degrades a different case. Without a way to measure the whole surface, you trade visible bugs for invisible ones and call it progress.

Operational cost — latency and tokens per request

A system that is two percent more accurate but three times slower and four times more expensive may be a worse system. Quality numbers without cost numbers are half a decision.

You need all three. Most teams measure the first, eyeball the second, and discover the third from the invoice.

The Eval Set Is the Asset

The most valuable artifact in an AI project is not the prompt and not the model. It is the labelled set of inputs with known-good outputs. The prompt and the model are things you will change repeatedly; the eval set is the fixed point that tells you whether the change helped. Build it before you build the system.

Where the examples come from

Real inputs beat synthetic ones. Pull from actual user requests, support tickets, the documents the system will really see. Synthetic examples are fine for filling a coverage gap — a rare case you do not have real data for yet — but an eval set that is entirely synthetic only measures how well your model agrees with the model that generated the data.

Size: smaller than you think, but deliberate

You do not need ten thousand cases. You need fifty to two hundred that are deliberately chosen: the common case, the boring case, the known hard cases, the edge cases that have burned you before, and a handful of adversarial inputs. A curated eighty-case set that covers your real failure modes beats a random two-thousand-case dump every time — and you will actually look at the results.

Give every case a reason to exist

Each case needs three things: the input, the expected output or the expected properties of the output, and a one-line note on why it is in the set — "regression for the bug from March," "tests refusal of out-of-scope requests," "the common case." That note is what stops the eval set rotting into a pile of examples nobody trusts and nobody dares delete.

A Scoring Ladder: Cheapest Method That Still Discriminates

Do not reach for the expensive scoring method first. Climb the ladder, and stop at the lowest rung that still tells the difference between a good output and a bad one.

1. Deterministic checks

Did it return valid JSON? Are the required fields present? Is it under the length limit? Did it cite a source ID that actually exists? Free, instant, and they catch most catastrophic failures. Start here, always.

2. Property assertions

Not exact-match — exact-match breaks on every valid rephrasing — but specific, checkable claims: the answer contains the order number, the summary mentions the deadline, no PII appears in the output. Cheap, robust, and they encode what actually matters instead of the exact words.

3. LLM-as-judge

A separate model scores the output against a rubric. Necessary for anything genuinely subjective — tone, helpfulness, reasoning quality, whether an answer is well-structured. Powerful, and easy to misuse — the next section is entirely about not fooling yourself with it.

4. Human review

The ground truth, and the most expensive rung. Reserve it for calibrating the cheaper methods and for the cases where the cheaper methods disagree. You cannot human-review every run; you can human-review enough to trust the automation that does.

Rule of thumb: a deterministic check that catches a real bug is worth more than an LLM judge that has an opinion about it.

LLM-as-Judge Without Fooling Yourself

LLM-as-judge is the workhorse of modern evaluation and the easiest place to quietly lie to yourself. The failure modes are well known, and on a real project you will hit all of them.

Position and verbosity bias

Judge models favour the option shown first and the answer that is longer, independent of quality. If you compare two outputs, randomise the order and control for length — otherwise you are measuring formatting and calling it quality.

Self-preference

A model judging outputs tends to rate its own family's style more highly. If the same model generates the output and grades it, your scores are optimistic. Use a different model — ideally a different provider — as the judge.

Vague rubrics produce vague scores

"Rate the helpfulness from one to ten" gives you noise dressed up as a number. "Does the answer resolve the user's question without inventing facts? Yes or no, then one line of reasoning" gives you a signal. Use binary or low-cardinality scales with explicit, written criteria.

No human calibration

An LLM judge you have never checked against human labels is an unvalidated instrument. Label thirty to fifty cases by hand, measure how often the judge agrees, and re-check that agreement every time you change the judge model or the rubric.

LLM-as-judge is a measurement tool, not an oracle. Treat it like one: calibrate it, version it, and be suspicious of a score that jumped the same week you changed the judge.

Regression Testing: Where Evals Earn Their Keep

The first eval run tells you where you stand. Every run after that is the actual return on the work. AI systems change constantly — prompts get tuned, models get swapped, a dependency updates, a provider silently adjusts the model behind a stable API name. Each of those can improve your target case and quietly break five others.

Run the full eval set on every meaningful change: prompt edits, model version bumps, retrieval changes, a rebuilt RAG index. Then gate it. A change that drops the score below an agreed threshold does not merge. Wiring the eval suite into CI as a merge gate is the single highest-leverage item on this list, and it is the one most teams skip — because it feels like overhead until the first time it catches a regression you were about to ship.

Two cases deserve specific attention. When you swap models — even "upgrading" to a newer one — the eval set is how you find out what you traded away; newer is not uniformly better for your specific task. And watch for provider drift: a hosted model behind a stable API name is not a stable model. Periodic re-runs against a fixed eval set catch that change before your users do. The same discipline applies whether you are evaluating a prompt, a retrieval pipeline, or a fine-tuned model.

Offline Evals Are Not Enough: Measuring in Production

A perfect score on your eval set means the system handles the inputs you thought of. Production is the inputs you did not. Offline evaluation is necessary and it is not sufficient — you also need to measure the system where it actually runs.

Sample and trace real traffic

Log the input, the output, the retrieved context, and the token counts for a sample of live requests. You cannot fix what you cannot see, and the distribution of real inputs always drifts away from your eval set over time.

Watch the cheap online signals

Implicit feedback — did the user retry, rephrase, abandon the session, copy the answer, or escalate to a human — is noisier than a label but it is free and it is real. A sudden spike in retries is a regression alarm you did not have to build.

Feed production failures back into the eval set

Every real failure that was not already in your eval set is a gap in the set. The loop is simple: production surfaces a failure, it becomes a labelled case, the regression suite catches it forever. That loop is what makes a system get better instead of just different.

Where to Start

You do not need an eval platform, a vendor, or a framework to begin. The first version is deliberately unsophisticated:

One. Fifty real inputs with known-good outputs, in a file, each with a note on why it is there.

Two. A script that runs them through the system and applies deterministic checks plus a handful of property assertions.

Three. One number at the end, and the list of which cases failed.

Run it before and after your next change. The first time it catches a regression you would have shipped, it has paid for itself. Add LLM-as-judge when deterministic checks genuinely cannot capture what you care about — not before. Add production tracing once the offline set is stable. The teams with reliable AI systems did not start with sophisticated evaluation; they started with a file of examples and the discipline to run it on every change.