Also known as: fine-tune, task-specific training, domain adaptation
TL;DR
Fine-tuning is the process of further training a pre-trained model on task-specific or domain-specific data. It's how a generalist becomes a specialist.
Fine-tuning is what turns a general-purpose pre-trained model into a specialist. Pre-training teaches a model to understand language at scale (next-token prediction over trillions of tokens); fine-tuning teaches it to do your specific task with the broad knowledge it already has.
Almost every specialized model in production retrieval is fine-tuned: zerank-2 is a fine-tuned Qwen3-4B; zembed-1 is a bi-encoder fine-tuned via distillation from zerank-2; custom rerankers for legal/medical/financial domains are zerank-2 further fine-tuned on domain data.
The core recipe
Start with a pre-trained base model with strong general language ability (Qwen, Llama, Mistral).
Collect or generate task-specific training data — for retrieval, this is typically (query, document, relevance) triples, often via zELO -style pairwise LLM judgments converted into pointwise targets.
Continue training the model on the new data with a smaller learning rate than pre-training. The base model’s general knowledge is preserved; the task-specific behavior is learned on top.
Validate on held-out task-specific evals; iterate.
That’s the whole shape. Most of the difficulty is in step 2 (data quality) and the choice of step 3 hyperparameters.
Variants you’ll encounter
Full fine-tuning — every parameter is trained. Most expressive, most expensive, biggest risk of catastrophic forgetting (the model loses general capability while gaining the specific one).
LoRA / PEFT — only train tiny adapter matrices, freeze the base model. ~1% of the parameters trained, comparable results, easy to swap adapters in and out for different tasks. The default for most production fine-tunes today.
Instruction tuning — fine-tune on (instruction, response) pairs to teach a base model to follow instructions. The standard step that turns “GPT-base” into “GPT-instruct”.
RLHF / DPO — alignment-focused fine-tuning on pairwise preferences. RLHF uses reinforcement learning with a learned reward model; DPO is a more recent method that achieves similar results with simpler training. Both are about making the model produce preferred outputs, not just plausible ones.
Three operationally important differences. First, swappability: LoRA adapters are typically a few hundred MB on a multi-billion-parameter base, so you can keep dozens of fine-tunes (one per customer, one per task) and swap them in and out at inference time without re-loading the base weights — full fine-tunes can’t share infrastructure that way. Second, catastrophic forgetting: full fine-tuning updates every weight, which can erode general capability the base learned during pretraining; LoRA isolates task-specific changes to small adapter matrices and the base weights stay frozen, so generalist behavior survives. Third, training cost: LoRA needs orders of magnitude less GPU memory because gradients only flow through the adapters, so you can fine-tune a 70B model on a single H100. Full fine-tuning of the same model needs a whole node. The result is identical-or-better quality on narrow tasks at a fraction of the engineering and infra cost — which is why almost every production fine-tune in 2025 is LoRA-style.
When fine-tuning is the right answer
You’re doing the same narrow task at high volume (retrieval, reranking, classification, query rewriting).
The task has stable structure — input and output formats don’t drift.
Per-call cost matters; you’d otherwise be making expensive LLM calls for every instance.
You have domain knowledge a generalist model doesn’t (legal terminology, internal product taxonomy, your customers’ query patterns).
When prompting is enough
Low volume, exploratory, or rapidly evolving tasks.
Tasks that benefit from the LLM’s broad reasoning, not narrow specialization.
Where you can’t justify the data-collection and training cost.
More of production AI lives in the first bucket than people typically realize, which is why the long-term shape of inference looks more like a constellation of fine-tuned specialists than one giant LLM doing everything.
Pretraining gives you general intelligence; fine-tuning gives you a useful tool. The production-AI thesis is that “useful tool” usually means a small specialist — and small specialists are almost always built by fine-tuning a frontier base.
Fine-tuned specialist models in production
zerank-2: Qwen3-4B fine-tuned via zELO on pairwise relevance judgments — a reranker
zembed-1: bi-encoder fine-tuned by distilling from zerank-2 — an embedding model
Domain-specific rerankers: zerank-2 further fine-tuned on legal, medical, or financial corpora
Faithfulness checkers: encoder fine-tuned on (claim, context, label) triples
Query rewriters: small decoder LLMs fine-tuned on (raw query, expanded query) pairs
Tool-selection models: classifier-style fine-tunes that pick which tool to invoke from a catalog of dozens
Go further
When does fine-tuning beat prompting?
When the task is high-volume, narrow, and stable — every reranker call, every query rewrite, every classification. Prompting is right for one-off and exploratory uses; fine-tuning is right for the production hot path where per-call cost and quality stability dominate.
What's the difference between fine-tuning, instruction tuning, RLHF, and DPO?
Fine-tuning is the umbrella term — any further training on a pre-trained model. Instruction tuning is fine-tuning on (instruction, response) pairs to make a base model follow directions. RLHF (reinforcement learning from human feedback) and DPO (direct preference optimization) further refine the model on pairwise preference data — they're the alignment step on top of instruction tuning.
Almost never. LoRA (Low-Rank Adaptation) and other parameter-efficient fine-tuning (PEFT) methods train tiny adapter matrices on top of frozen base weights — typically 1% of the parameters, with results that match or beat full fine-tuning on most narrow tasks. The base model stays general; the LoRA adapter encodes your specific specialization.