LoRA and Parameter-Efficient Fine-Tuning (PEFT)

Also known as: LoRA, low-rank adaptation, PEFT, parameter-efficient fine-tuning

TL;DR

LoRA injects tiny low-rank adapter matrices into a frozen base model and trains only those — typically ~1% of the parameters. Results match or beat full fine-tuning on most narrow tasks at a fraction of the memory and storage cost.

LoRA — Low-Rank Adaptation (Hu et al., 2021) — is the dominant parameter-efficient method. The trick is simple, the savings are large (~1% trainable params, ~5× less memory), and the quality matches full fine-tuning on most narrow tasks.

LoRA · LOW-RANK ADAPTATIONTrain two thin matrices instead of one fat one.Wd × d≈ 16.7M paramsFROZEN+Bd × r·Ar × dTRAINED · ∇ FLOWS≈ 65K paramsTRAINABLE PARAMETERS · LoRA VS FULL FINE-TUNEFULL16,777,216LoRA65,536256× fewer trainable weightsd = 4096 · r = 8 · 2 · d · r = 65,536y = W·x + (B · A)·xr ≪ d

The trick

Full fine-tuning learns a weight delta for every weight matrix in the model. For a matrix that’s parameters per layer — for a 7B model, billions in total.

LoRA replaces with a low-rank factorization:

With and , has M parameters but has only K — a 250× reduction. The base is frozen; only and are trained.

At inference, you compute — same outputs as full fine-tuning, plus a tiny extra matrix multiply.

The original paper attached LoRA only to the attention and projections. That’s a fine default for narrow tasks, but later work (especially QLoRA) found that applying LoRA to all linear layers in attention plus the FFN up/down projections — outperforms attention-only on harder tasks. The cost is small (LoRA params are already tiny) and the quality lift is consistent. The modern default in peft and axolotl is target_modules="all-linear". Avoid attaching LoRA to layer-norms or biases — they’re not where the task-specific structure lives, and you’ll just add noise.

Why it works

The intrinsic dimensionality of task-specific updates is low. Aghajanyan et al. (2020) showed that fine-tuning a transformer on a downstream task only needs to move weights along a subspace of dimension ~10²-10³, even though the full parameter space is much larger. LoRA’s rank- constraint matches this shape directly. You’re not throwing away expressivity; you’re matching the geometry of what fine-tuning actually does.

Practical wins

  • Memory. Optimizer state and gradients only for the LoRA params, not the base. Fine-tuning a 7B model with full fine-tuning needs ~80GB; LoRA needs ~16GB.
  • Storage. A LoRA adapter is ~10-50MB. You can ship 100 task-specific adapters in the storage cost of one full fine-tune.
  • Hot-swappable. Load the base once, swap LoRA adapters per request. Multi-tenant serving with per-customer adapters becomes feasible.
  • mitigation. Because the base weights are frozen, the model can’t drift from its pre-trained knowledge. Helpful when fine-tuning data is narrow.
Where LoRA ships in production
  • Per-customer fine-tunes on a shared base — multi-tenant serving with adapter hot-swap
  • Reranker / embedder specialization from a frontier base model
  • Domain adapters (legal, medical, code) on the same general LLM
  • QLoRA on a single 48GB GPU for 70B-parameter base models
  • Style / persona adapters that ship as 20MB artifacts instead of full checkpoints

QLoRA: the cost-collapse variant

QLoRA (Dettmers et al., 2023) quantizes the frozen base model to 4-bit NF4 precision and trains LoRA adapters in 16-bit on top. Memory drops another 2-3×; quality is nearly indistinguishable from full-precision LoRA. A 70B-parameter model fine-tunes on a single A100/H100. This is what made open-weight fine-tuning a hobbyist activity in 2023 and a default production technique in 2024+.

Where LoRA underperforms

  • Tasks requiring substantial new knowledge the base doesn’t have. The low-rank constraint can’t add capacity, only steer existing capacity.
  • Continuous pre-training scenarios. If the goal is to absorb a large new domain, you need more capacity than rank-8 adapters provide. Full fine-tuning or higher-rank LoRA (r=64+) is appropriate.

For most production retrieval fine-tunes ( , , query rewriters, classifiers), LoRA at - matches full fine-tuning. There’s rarely a reason to do otherwise.

Go further

Why does training 1% of weights match full fine-tuning?

The intrinsic dimensionality of fine-tuning updates is low — Aghajanyan et al. (2020) showed task-specific updates lie on a subspace much smaller than the full parameter space. LoRA exploits this directly: instead of learning a full delta , learn with , , . The constraint matches the natural shape of fine-tuning updates.

What's QLoRA and why does everyone use it?

QLoRA quantizes the frozen base model to 4-bit (NF4) and trains LoRA adapters on top in 16-bit. A 70B model fine-tunes on a single 48GB GPU instead of an 8-GPU node. Quality loss vs full-precision LoRA is small; cost difference is huge. The default for indie/small-team fine-tuning.

How do PEFT methods other than LoRA compare?

Prefix tuning (learnable token prefix) and prompt tuning (learnable continuous prompts) work for some tasks but underperform LoRA on harder ones. Adapter layers (Houlsby) are LoRA's predecessor and add inference latency. IA³ multiplies activations by learned scalars — extremely cheap but limited capacity. LoRA hit the sweet spot and won.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord