Fine-Tuning

Also known as: fine-tune, task-specific training, domain adaptation

TL;DR

Fine-tuning is the process of further training a pre-trained model on task-specific or domain-specific data. It's how a generalist becomes a specialist.

Fine-tuning is what turns a general-purpose pre-trained model into a specialist. Pre-training teaches a model to understand language at scale (next-token prediction over trillions of tokens); fine-tuning teaches it to do your specific task with the broad knowledge it already has.

FINE-TUNING · TASK HEAD ATOP A FROZEN BASEA generalist becomes a specialist.ARCHITECTUREembeddingtransformer block 1transformer block 2transformer block 3transformer block 4transformer block 5transformer block 6FROZENtask head∇ FLOWS HEREinput xPARAMETERS · 7B BASE · 70M HEAD99% FROZEN1% TRAINEDLOSS VS FINE-TUNE STEPStraining stepsloss

Almost every specialized model in production retrieval is fine-tuned: zerank-2 is a fine-tuned Qwen3-4B; zembed-1 is a fine-tuned via from zerank-2; custom rerankers for legal/medical/financial domains are zerank-2 further fine-tuned on domain data.

The core recipe

  1. Start with a pre-trained base model with strong general language ability (Qwen, Llama, Mistral).
  2. Collect or generate task-specific training data — for retrieval, this is typically (query, document, relevance) triples, often via -style pairwise LLM judgments converted into pointwise targets.
  3. Continue training the model on the new data with a smaller learning rate than pre-training. The base model’s general knowledge is preserved; the task-specific behavior is learned on top.
  4. Validate on held-out task-specific evals; iterate.

That’s the whole shape. Most of the difficulty is in step 2 (data quality) and the choice of step 3 hyperparameters.

Variants you’ll encounter

LoRA · LOW-RANK ADAPTERSSame update, two orders of magnitude fewer trainable weights.FULL FINE-TUNELoRA ADAPTERWd × d · ALL TRAINABLE≈ 16.8M paramsd = 4096WFROZENd × d · UNTOUCHED+Ad × r·Br × dr = 8≈ 65K paramsW=W+A · BSame expressive update at the layer — but the optimiser only touches A and B.TRAINABLE PARAMETERS · PER MATRIX16,777,216 65,536 · ~256× smaller
  • Full fine-tuning — every parameter is trained. Most expressive, most expensive, biggest risk of catastrophic forgetting (the model loses general capability while gaining the specific one).
  • LoRA / PEFT — only train tiny adapter matrices, freeze the base model. ~1% of the parameters trained, comparable results, easy to swap adapters in and out for different tasks. The default for most production fine-tunes today.
  • Instruction tuning — fine-tune on (instruction, response) pairs to teach a base model to follow instructions. The standard step that turns “GPT-base” into “GPT-instruct”.
  • RLHF / DPO — alignment-focused fine-tuning on pairwise preferences. RLHF uses reinforcement learning with a learned reward model; DPO is a more recent method that achieves similar results with simpler training. Both are about making the model produce preferred outputs, not just plausible ones.

Three operationally important differences. First, swappability: LoRA adapters are typically a few hundred MB on a multi-billion-parameter base, so you can keep dozens of fine-tunes (one per customer, one per task) and swap them in and out at inference time without re-loading the base weights — full fine-tunes can’t share infrastructure that way. Second, catastrophic forgetting: full fine-tuning updates every weight, which can erode general capability the base learned during pretraining; LoRA isolates task-specific changes to small adapter matrices and the base weights stay frozen, so generalist behavior survives. Third, training cost: LoRA needs orders of magnitude less GPU memory because gradients only flow through the adapters, so you can fine-tune a 70B model on a single H100. Full fine-tuning of the same model needs a whole node. The result is identical-or-better quality on narrow tasks at a fraction of the engineering and infra cost — which is why almost every production fine-tune in 2025 is LoRA-style.

When fine-tuning is the right answer

  • You’re doing the same narrow task at high volume (retrieval, reranking, classification, query rewriting).
  • The task has stable structure — input and output formats don’t drift.
  • Per-call cost matters; you’d otherwise be making expensive calls for every instance.
  • You have domain knowledge a generalist model doesn’t (legal terminology, internal product taxonomy, your customers’ query patterns).

When prompting is enough

  • Low volume, exploratory, or rapidly evolving tasks.
  • Tasks that benefit from the LLM’s broad reasoning, not narrow specialization.
  • Where you can’t justify the data-collection and training cost.

More of production AI lives in the first bucket than people typically realize, which is why the long-term shape of inference looks more like a constellation of fine-tuned specialists than one giant LLM doing everything.

Pretraining gives you general intelligence; fine-tuning gives you a useful tool. The production-AI thesis is that “useful tool” usually means a small specialist — and small specialists are almost always built by fine-tuning a frontier base.

Fine-tuned specialist models in production
  • zerank-2: Qwen3-4B fine-tuned via zELO on pairwise relevance judgments — a reranker
  • zembed-1: bi-encoder fine-tuned by distilling from zerank-2 — an embedding model
  • Domain-specific rerankers: zerank-2 further fine-tuned on legal, medical, or financial corpora
  • Faithfulness checkers: encoder fine-tuned on (claim, context, label) triples
  • Query rewriters: small decoder LLMs fine-tuned on (raw query, expanded query) pairs
  • Tool-selection models: classifier-style fine-tunes that pick which tool to invoke from a catalog of dozens
Go further

When does fine-tuning beat prompting?

When the task is high-volume, narrow, and stable — every reranker call, every query rewrite, every classification. Prompting is right for one-off and exploratory uses; fine-tuning is right for the production hot path where per-call cost and quality stability dominate.

What's the difference between fine-tuning, instruction tuning, RLHF, and DPO?

Fine-tuning is the umbrella term — any further training on a pre-trained model. Instruction tuning is fine-tuning on (instruction, response) pairs to make a base model follow directions. RLHF (reinforcement learning from human feedback) and DPO (direct preference optimization) further refine the model on pairwise preference data — they're the alignment step on top of instruction tuning.

Do I need to fine-tune the whole model?

Almost never. LoRA (Low-Rank Adaptation) and other parameter-efficient fine-tuning (PEFT) methods train tiny adapter matrices on top of frozen base weights — typically 1% of the parameters, with results that match or beat full fine-tuning on most narrow tasks. The base model stays general; the LoRA adapter encodes your specific specialization.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord