LLM Fine-Tuning Pipeline

End-to-end custom model training, delivered as Docker

Built by Rogue AI · Engineered end-to-end · Production since 2026

First clean docker-compose run end-to-end: late January 2026. Fifty-plus commits across training, conversion, and benchmark layers since. Most recent iteration: April 2026.

LLM Fine-Tuning Pipeline, End-to-end custom model training, delivered as Docker

The problem

Off-the-shelf LLMs don't know your domain. Cloud fine-tuning APIs are expensive, slow, and leak proprietary training data. Most open-source fine-tuning recipes are notebook demos that fall over in production.

What I built

A containerized fine-tuning pipeline that runs on a single commodity GPU. Ingests JSONL training data, runs QLoRA training with a configurable base model, merges adapter weights, exports to GGUF for Ollama, and runs a benchmark harness, all from one docker compose up.

Architecture

Dataset loader

Validates JSONL schema, deduplicates, splits train/eval

QLoRA trainer

PEFT + bitsandbytes 4-bit quantization, configurable rank/alpha/target-modules

Checkpoint merger

Merges adapter into base weights, saves HF-format model

GGUF exporter

llama.cpp conversion with configurable quantization (Q4_K_M / Q5_K_M / Q8_0)

Ollama registrar

Generates Modelfile, pushes to local Ollama instance

Benchmark harness

Perplexity + task-specific evals against held-out test set

Tech stack

PythonLlamaFactoryllama.cppOllamaMLflowRunPodDocker

What broke first

▸
Dataset quality dwarfs everything. Spent two weekends tuning rank/alpha before admitting the JSONL was the problem, 1,200 cleaned rows beat 8,000 dirty ones.
▸
Q4_K_M is a trap for technical text. The model started hallucinating CLI flags. Default quantization is now Q5_K_M, slower inference, less lying.
▸
Rank > 64 on a 13B base OOMs a 24 GB card no matter how clever your bitsandbytes config is. Page-table thrashing masquerades as 'just slow' until it isn't.

Outcome

Trained custom models on domain-specific corpora without sending data to third-party APIs. Inference served locally via Ollama on the same host. Replaces recurring fine-tuning spend with a one-time training run.

Honest limits

No eval harness for tool-use or function-calling yet, only perplexity and held-out task accuracy. The Modelfile generator hardcodes the system prompt; multi-tenant use would need that templated. RunPod is still cheaper for experiment runs, so the pipeline runs both, but the local path is what production uses.