Skip to main content
Home / Portfolio / LLM Fine-Tuning Pipeline
Model Training

LLM Fine-Tuning Pipeline

End-to-end custom model training, delivered as Docker

Built by Rogue AI · Engineered end-to-end · Production since 2026

First clean docker-compose run end-to-end: late January 2026. Fifty-plus commits across training, conversion, and benchmark layers since. Most recent iteration: April 2026.

LLM Fine-Tuning Pipeline — End-to-end custom model training, delivered as Docker

The problem

Off-the-shelf LLMs don't know your domain. Cloud fine-tuning APIs are expensive, slow, and leak proprietary training data. Most open-source fine-tuning recipes are notebook demos that fall over in production.

What I built

A containerized fine-tuning pipeline that runs on a single commodity GPU. Ingests JSONL training data, runs QLoRA training with a configurable base model, merges adapter weights, exports to GGUF for Ollama, and runs a benchmark harness — all from one docker compose up.

Architecture

Dataset loader
Validates JSONL schema, deduplicates, splits train/eval
QLoRA trainer
PEFT + bitsandbytes 4-bit quantization, configurable rank/alpha/target-modules
Checkpoint merger
Merges adapter into base weights, saves HF-format model
GGUF exporter
llama.cpp conversion with configurable quantization (Q4_K_M / Q5_K_M / Q8_0)
Ollama registrar
Generates Modelfile, pushes to local Ollama instance
Benchmark harness
Perplexity + task-specific evals against held-out test set

Tech stack

PythonLlamaFactoryllama.cppOllamaMLflowRunPodDocker

What broke first

  • Dataset quality dwarfs everything. Spent two weekends tuning rank/alpha before admitting the JSONL was the problem — 1,200 cleaned rows beat 8,000 dirty ones.

  • Q4_K_M is a trap for technical text. The model started hallucinating CLI flags. Default quantization is now Q5_K_M — slower inference, less lying.

  • Rank > 64 on a 13B base OOMs a 24 GB card no matter how clever your bitsandbytes config is. Page-table thrashing masquerades as 'just slow' until it isn't.

Outcome

Trained custom models on domain-specific corpora without sending data to third-party APIs. Inference served locally via Ollama on the same host. Replaces recurring fine-tuning spend with a one-time training run.

Honest limits

No eval harness for tool-use or function-calling yet — only perplexity and held-out task accuracy. The Modelfile generator hardcodes the system prompt; multi-tenant use would need that templated. RunPod is still cheaper for experiment runs, so the pipeline runs both, but the local path is what production uses.

Related reading

← Back to portfolio