LoRA Fine-Tuning: A Practical Guide to Training Custom Models
Most LoRA tutorials show you how to run a training script. That's the easy part. The hard parts are upstream and downstream: knowing whether LoRA is the right tool for your problem at all, building a dataset that actually teaches the model what you want, and verifying afterwards that you taught it the pattern rather than memorised the noise. This guide covers the full loop, from problem framing to deployment, based on production LoRA training across both language and image models.
What LoRA Actually Does
Low-Rank Adaptation freezes the base model's weights and trains a small pair of low-rank matrices that get added to specific layers (usually attention projections, sometimes MLPs too). Instead of updating the full d × d weight matrix, you train two smaller matrices of shape d × r and r × d, where r is typically 8 to 64. The product is added to the frozen weights at inference time.
The practical consequences: training memory drops by 10x or more, the adapter file is megabytes instead of gigabytes, and you can hot-swap adapters on the same base model. The cost: LoRA can shift behaviour but can't reliably teach the model facts it didn't already know. That distinction is the single most important thing to internalise before you start.
LoRA vs RAG: Pick the Right Tool
Half the LoRA projects I've seen should have been RAG projects. The decision is straightforward once you frame it correctly:
Use RAG when the answer lives in documents
Customer support knowledge bases, product docs, contract clauses, internal wikis. The model needs to look things up, not change how it reasons. Adding a new product line shouldn't require retraining anything.
Use LoRA when you need to change behaviour or style
Always responding in a specific format, adopting a domain tone, following an unusual instruction pattern, or producing a specific visual style for image models. The model already knows the material; you're teaching it how to present it.
Use both when both apply
A LoRA-fine-tuned model that responds in your house style, with RAG providing the factual grounding. The LoRA shapes the voice; RAG keeps it accurate. This is the most common production pattern.
Where LoRA fails: teaching new facts
If you fine-tune a model on a thousand examples of "Q: When did we launch product X? A: March 2025", you'll get correct answers on those exact phrasings and unreliable answers on every other phrasing of the same question. Use RAG.
Dataset Preparation: Where Most Projects Fail
Dataset quality dominates everything else. A 500-example clean dataset will outperform a 10,000-example noisy one. The patterns that actually matter:
Format consistency
Pick one prompt template and use it for every example. If half your examples have a system message and half don't, the model learns that the system message is optional context rather than a controlling signal. For chat models, follow the base model's expected conversation format exactly — wrong special tokens silently degrade results.
Diversity of input, consistency of output style
You want the model to recognise the pattern across many phrasings. Generate or collect inputs from real users, paraphrase aggressively, and include edge cases. Outputs should be consistent in structure and voice — that's the behaviour you're actually training.
Negative and refusal examples
If the model needs to decline certain requests or stay in scope, include explicit examples of those refusals. Without them, fine-tuning on positive examples alone makes the model more eager and less cautious — a common regression.
Size targets by task
Style or format adaptation: 300–1,000 high-quality examples is usually enough. Diminishing returns after ~2,000.
Domain language adaptation: 5,000–20,000 examples covering the target vocabulary in real-usage context.
Image LoRA (FLUX, SDXL): 15–60 high-quality, well-captioned images for a subject or style. More isn't better past that point — overfitting starts dominating.
Hyperparameters That Actually Matter
Most LoRA hyperparameters can be left at defaults. A handful are load-bearing:
Rank (r) and alpha
Rank controls capacity. Start at r=16 for most LLM adaptation. Go to 32 or 64 for harder behavioural shifts. Higher ranks aren't free — they overfit small datasets faster. Set alpha = 2 × r as a default; adjust scaling after the first run if outputs feel too weak or too strong.
Learning rate
1e-4 to 3e-4 for most LLM LoRA training. Image LoRA on FLUX or SDXL runs lower, around 1e-4 to 5e-5. Higher rates train faster and overfit faster. If your loss curve is jagged, drop the LR by 2x.
Target modules
For LLMs, applying LoRA to all linear layers (attention + MLP) outperforms attention-only for most behavioural tasks, at modest cost. The PEFT library default of attention-only is conservative — override it for serious work.
Epochs
2–4 epochs for most adaptation tasks. Watch the eval loss curve. If training loss keeps falling while eval loss rises, you're memorising — stop earlier. Image LoRA training typically needs more steps but the same overfitting watch applies.
Evaluation: Knowing It Actually Worked
Loss curves tell you the model fit your training data. They don't tell you the model is better at the task. Real evaluation has three layers:
Held-out test set
Split 10–20% of your dataset out before training. Same prompt format, same domain, examples the model has never seen. Score it with whatever metric matches the task — exact match for structured outputs, BLEU/ROUGE for generation, human rating for anything subjective.
Out-of-distribution probes
Write 30–50 prompts that test the capability you wanted to teach, in phrasings that don't appear in your dataset. This is where memorisation gets exposed. A model that scores well on the held-out set but fails OOD probes hasn't learned the pattern.
Capability regression checks
Run a small fixed set of general benchmarks before and after training — basic reasoning, instruction following, refusal of unsafe prompts. Aggressive LoRA training degrades general capabilities even when the target task improves. You want to catch that trade-off explicitly.
Deployment Patterns
How you serve a LoRA depends on whether you need one adapter or many:
Merged weights for single-tenant serving
If one team uses one adapter, merge the LoRA into the base model and serve the merged weights. No runtime overhead, no adapter-loading complexity. The model file is large but you ship it once.
Hot-swappable adapters for multi-tenant
vLLM, TGI, and Ollama all support runtime LoRA loading. One base model on the GPU, multiple adapters loaded per request. This is how you serve per-customer or per-use-case fine-tunes from a single GPU pool.
GGUF + Ollama for edge or air-gapped
Merge the LoRA, convert to GGUF, ship as a single Ollama model. Works offline, runs on consumer GPUs, and integrates with the rest of a self-hosted stack with no extra infrastructure.
The Common Pitfalls
Catastrophic forgetting on instruction following
High learning rate plus too many epochs on a narrow dataset and the model forgets how to follow general instructions. Mitigation: include a small percentage of generic instruction-following examples in the training mix.
Training on the wrong base model
Fine-tuning a base model when you needed an instruct-tuned model (or vice versa) is a multi-day mistake. Confirm the model lineage before training, especially when chaining LoRAs.
Dataset contamination from synthetic data loops
Generating training data with a larger model and fine-tuning a smaller one is a valid technique, but if the generator and the evaluator are the same model family, your eval scores are optimistic. Use independent evaluation models.
Skipping the version-control discipline
Treat every LoRA training run as a reproducible artifact: pinned base model checksum, dataset hash, hyperparameter config in version control, training logs archived. Without that, you can't answer "what changed between v3 and v4" three months later.
Where to Start
For LLM LoRA, the PEFT library on top of Hugging Face Transformers is the path of least resistance. Axolotl wraps it with sensible config-driven defaults if you'd rather not write training code from scratch. For image LoRA, kohya-ss tooling around FLUX and SDXL is the production standard.
Run your first training job small: a few hundred examples, low rank, 2 epochs. Look at the outputs. Iterate on the dataset before you iterate on hyperparameters — that's where the real wins live.
Related reading: self-hosted AI vs cloud APIs, LLM integration for business systems, and why most AI projects fail before production.