Technical Guide

LoRA Fine-Tuning: A Practical Guide to Training Custom Models

R
VarnaAI Founder
··12 min read

Most LoRA tutorials show you how to run a training script. That's the easy part. The hard parts are upstream and downstream: knowing whether LoRA is the right tool for your problem at all, building a dataset that actually teaches the model what you want, and verifying afterwards that you taught it the pattern rather than memorised the noise. This guide covers the full loop, from problem framing to deployment, based on production LoRA training across both language and image models.

What LoRA Actually Does

Low-Rank Adaptation freezes the base model's weights and trains a small pair of low-rank matrices that get added to specific layers (usually attention projections, sometimes MLPs too). Instead of updating the full d × d weight matrix, you train two smaller matrices of shape d × r and r × d, where r is typically 8 to 64. The product is added to the frozen weights at inference time.

The practical consequences: training memory drops by 10x or more, the adapter file is megabytes instead of gigabytes, and you can hot-swap adapters on the same base model. The cost: LoRA can shift behaviour but can't reliably teach the model facts it didn't already know. That distinction is the single most important thing to internalise before you start.

LoRA vs RAG: Pick the Right Tool

Half the LoRA projects I've seen should have been RAG projects. The decision is straightforward once you frame it correctly:

Use RAG when the answer lives in documents

Customer support knowledge bases, product docs, contract clauses, internal wikis. The model needs to look things up, not change how it reasons. Adding a new product line shouldn't require retraining anything.

Use LoRA when you need to change behaviour or style

Always responding in a specific format, adopting a domain tone, following an unusual instruction pattern, or producing a specific visual style for image models. The model already knows the material; you're teaching it how to present it.

Use both when both apply

A LoRA-fine-tuned model that responds in your house style, with RAG providing the factual grounding. The LoRA shapes the voice; RAG keeps it accurate. This is the most common production pattern.

Where LoRA fails: teaching new facts

If you fine-tune a model on a thousand examples of "Q: When did we launch product X? A: March 2025", you'll get correct answers on those exact phrasings and unreliable answers on every other phrasing of the same question. Use RAG.

Dataset Preparation: Where Most Projects Fail

Dataset quality dominates everything else. A 500-example clean dataset will outperform a 10,000-example noisy one. The patterns that actually matter:

Format consistency

Pick one prompt template and use it for every example. If half your examples have a system message and half don't, the model learns that the system message is optional context rather than a controlling signal. For chat models, follow the base model's expected conversation format exactly — wrong special tokens silently degrade results.

Diversity of input, consistency of output style

You want the model to recognise the pattern across many phrasings. Generate or collect inputs from real users, paraphrase aggressively, and include edge cases. Outputs should be consistent in structure and voice — that's the behaviour you're actually training.

Negative and refusal examples

If the model needs to decline certain requests or stay in scope, include explicit examples of those refusals. Without them, fine-tuning on positive examples alone makes the model more eager and less cautious — a common regression.

Size targets by task

Style or format adaptation: 300–1,000 high-quality examples is usually enough. Diminishing returns after ~2,000.

Domain language adaptation: 5,000–20,000 examples covering the target vocabulary in real-usage context.

Image LoRA (FLUX, SDXL): 15–60 high-quality, well-captioned images for a subject or style. More isn't better past that point — overfitting starts dominating.

Hyperparameters That Actually Matter

Most LoRA hyperparameters can be left at defaults. A handful are load-bearing:

Rank (r) and alpha

Rank controls capacity. Start at r=16 for most LLM adaptation. Go to 32 or 64 for harder behavioural shifts. Higher ranks aren't free — they overfit small datasets faster. Set alpha = 2 × r as a default; adjust scaling after the first run if outputs feel too weak or too strong.

Learning rate

1e-4 to 3e-4 for most LLM LoRA training. Image LoRA on FLUX or SDXL runs lower, around 1e-4 to 5e-5. Higher rates train faster and overfit faster. If your loss curve is jagged, drop the LR by 2x.

Target modules

For LLMs, applying LoRA to all linear layers (attention + MLP) outperforms attention-only for most behavioural tasks, at modest cost. The PEFT library default of attention-only is conservative — override it for serious work.

Epochs

2–4 epochs for most adaptation tasks. Watch the eval loss curve. If training loss keeps falling while eval loss rises, you're memorising — stop earlier. Image LoRA training typically needs more steps but the same overfitting watch applies.

Evaluation: Knowing It Actually Worked

Loss curves tell you the model fit your training data. They don't tell you the model is better at the task. Real evaluation has three layers:

Held-out test set

Split 10–20% of your dataset out before training. Same prompt format, same domain, examples the model has never seen. Score it with whatever metric matches the task — exact match for structured outputs, BLEU/ROUGE for generation, human rating for anything subjective.

Out-of-distribution probes

Write 30–50 prompts that test the capability you wanted to teach, in phrasings that don't appear in your dataset. This is where memorisation gets exposed. A model that scores well on the held-out set but fails OOD probes hasn't learned the pattern.

Capability regression checks

Run a small fixed set of general benchmarks before and after training — basic reasoning, instruction following, refusal of unsafe prompts. Aggressive LoRA training degrades general capabilities even when the target task improves. You want to catch that trade-off explicitly.

Deployment Patterns

How you serve a LoRA depends on whether you need one adapter or many:

Merged weights for single-tenant serving

If one team uses one adapter, merge the LoRA into the base model and serve the merged weights. No runtime overhead, no adapter-loading complexity. The model file is large but you ship it once.

Hot-swappable adapters for multi-tenant

vLLM, TGI, and Ollama all support runtime LoRA loading. One base model on the GPU, multiple adapters loaded per request. This is how you serve per-customer or per-use-case fine-tunes from a single GPU pool.

GGUF + Ollama for edge or air-gapped

Merge the LoRA, convert to GGUF, ship as a single Ollama model. Works offline, runs on consumer GPUs, and integrates with the rest of a self-hosted stack with no extra infrastructure.

The Common Pitfalls

Catastrophic forgetting on instruction following

High learning rate plus too many epochs on a narrow dataset and the model forgets how to follow general instructions. Mitigation: include a small percentage of generic instruction-following examples in the training mix.

Training on the wrong base model

Fine-tuning a base model when you needed an instruct-tuned model (or vice versa) is a multi-day mistake. Confirm the model lineage before training, especially when chaining LoRAs.

Dataset contamination from synthetic data loops

Generating training data with a larger model and fine-tuning a smaller one is a valid technique, but if the generator and the evaluator are the same model family, your eval scores are optimistic. Use independent evaluation models.

Skipping the version-control discipline

Treat every LoRA training run as a reproducible artifact: pinned base model checksum, dataset hash, hyperparameter config in version control, training logs archived. Without that, you can't answer "what changed between v3 and v4" three months later.

Where to Start

For LLM LoRA, the PEFT library on top of Hugging Face Transformers is the path of least resistance. Axolotl wraps it with sensible config-driven defaults if you'd rather not write training code from scratch. For image LoRA, kohya-ss tooling around FLUX and SDXL is the production standard.

Run your first training job small: a few hundred examples, low rank, 2 epochs. Look at the outputs. Iterate on the dataset before you iterate on hyperparameters — that's where the real wins live.

Related reading: self-hosted AI vs cloud APIs, LLM integration for business systems, and why most AI projects fail before production.

Rogue AI • Production Systems •