Overfitting

Also known as: over-fitting, memorization

TL;DR

Overfitting is the failure mode where a model memorizes its training set instead of learning patterns that generalize. It's the central concern of classical statistical learning.

Overfitting is when a model memorizes its training set instead of learning generalizable patterns. The classical sign: perfect on data it has seen, poor on data it hasn’t. The classical fix: limit capacity, regularize, or stop training before it memorizes. It’s the central concern of textbook statistical learning — and the place where modern deep learning has most clearly broken with the textbook.

OVERFITTING · THREE REGIMES + LOSS CURVESFit the signal. Don’t fit the noise.UNDERFIThigh biasxJUST RIGHTcaptures trendxOVERFITmemorises noisexLOSS VS TRAINING TIME · TRAIN VS VALlosstraining steptrainvalidationEARLY-STOP HEREmemorising noise

The classical picture

In the standard statistical-learning view, every model has some capacity to fit functions. If capacity is too low, the model can’t represent the true relationship — that’s underfitting, high bias. If capacity is too high, the model fits the noise in the training data along with the signal — that’s overfitting, high variance. The textbook plot is U-shaped: validation error decreases with capacity, hits a minimum, then climbs again as capacity exceeds what the data justifies.

The classical diagnostic is the gap between training and validation loss. Training loss going down while validation loss goes up means the model is fitting noise. Common interventions:

Classical overfitting fixes
  • Reduce model capacity (fewer parameters)
  • Add more training data
  • Apply L2 regularization (weight decay)
  • Use dropout to randomize co-adaptation
  • Early-stop training when validation loss bottoms out
  • Cross-validate to get a less noisy estimate of generalization

This view dominated machine learning for thirty years. It still applies cleanly to most classical models — linear regression, gradient-boosted trees, small neural nets. The whole is built on it.

The modern complication

Then GPT-3 has 175 billion parameters trained on 300 billion tokens. The textbook says a model with that many parameters relative to data should overfit catastrophically. It doesn’t — it generalizes better than smaller models do. The same pattern shows up across vision, language, and tabular benchmarks: massively overparameterized models, trained to interpolate the training data (zero training loss), routinely beat smaller “right-sized” models on held-out data. The classical bias-variance curve does not apply.

What’s actually going on

The modern consensus is that overparameterized neural networks have an implicit bias — the optimizer (typically stochastic gradient descent) doesn’t pick an arbitrary zero-training-loss solution. It picks one with low norm, low complexity by some other measure, or one that interpolates smoothly between training points. This implicit regularization is doing work that the textbook view didn’t account for.

Empirically, three modern phenomena reshape the overfitting picture:

  • Double descent — test error is non-monotonic in model size; bigger models can be better.
  • Grokking — generalization can emerge long after training-set memorization, suggesting two distinct learning phases.
  • Scaling laws — over many orders of magnitude, bigger models trained on more data keep improving with no overfitting plateau in sight.

No. Overfitting still happens — and it bites in exactly the cases the textbook predicts. Small fine-tuning runs on small datasets overfit fast; a 7B model fine-tuned on 1,000 examples for too many epochs will memorize and lose general capability. Specialized rerankers and embedding models trained on small labeled datasets need careful regularization. What changed is that pretraining — the regime of huge-model-on-huge-data — doesn’t fit the textbook picture. Fine-tuning, low-resource training, and small models all still do.

The rule of thumb: if the data is plentiful relative to capacity, overfitting is mostly a non-issue and you should grow the model. If the data is scarce relative to capacity, classical overfitting is alive and well — use , , and .

Overfitting is fitting noise instead of signal. The textbook story still applies to small models on small data; modern deep networks live in a different regime the textbook doesn’t predict.

Go further

How do you actually detect overfitting?

The clean signal is divergence between training loss and held-out validation loss. Training loss goes down; validation loss bottoms out and starts climbing. The gap between them is the conventional measure. If the gap widens monotonically over training, you're overfitting on at least some axis.

Why don't huge LLMs overfit the way the textbook predicts?

Empirically, models with billions of parameters trained on internet-scale data hit perfect or near-perfect training loss yet still generalize. This contradicts the classical bias-variance picture and is one of the central open problems in deep-learning theory. The phenomena of double descent and grokking are pieces of the puzzle.

Is data leakage the same as overfitting?

No, but it looks identical from outside. Data leakage means your validation set isn't actually held out — duplicates, near-duplicates, or downstream signals leaked into training. The model looks like it generalizes (low validation loss) but it's really memorizing. In practice, a surprising fraction of 'low overfitting' results turn out to be leakage.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord