RLHF (Reinforcement Learning from Human Feedback)

Also known as: reinforcement learning from human feedback, PPO alignment

TL;DR

RLHF is the classical alignment recipe: train a reward model from human pairwise preferences, then fine-tune the language model with PPO to maximize that reward.

RLHF — Reinforcement Learning from Human Feedback — is the alignment technique that turned base LLMs into the helpful, harmless, honest chat assistants people actually use. Introduced in production form by InstructGPT (Ouyang et al., 2022), RLHF is what separates “GPT-3” from “ChatGPT”.

The three-stage recipe

Supervised fine-tuning (SFT). Start with a pre-trained base model. Fine-tune on (prompt, demonstration) pairs to teach it to follow instructions. Output: a reference policy .
Reward model training. Sample multiple completions per prompt from the SFT model. Show humans pairs and ask which is better. Train a reward model to predict the pairwise preference — same Bradley-Terry shape as Elo:

PPO fine-tuning. Use Proximal Policy Optimization to update the policy to maximize expected reward, penalized by KL divergence from the reference:

The KL term keeps the policy from drifting too far and exploiting reward-model artifacts.

What RLHF actually does

RLHF is not “teaching the model what’s true” — it’s teaching the model what humans prefer to read. That distinction matters: the model can learn to be more confident, more verbose, more agreeable in ways that look aligned but aren’t. Reward hacking (the model finding inputs the reward model rates highly that humans actually wouldn’t) is a recurring failure mode.

Despite that, RLHF works remarkably well in practice. The InstructGPT result — a 1.3B-parameter RLHF’d model rated more helpful than a 175B SFT-only model — kicked off the modern alignment era.

The lesson RLHF taught the field: pairwise preferences are the right supervision shape for alignment. Modern reranker training inherits exactly this insight — pairwise LLM judgments at scale, no policy gradient required.

Why production has moved on

Three problems pushed the field toward DPO :

Instability. PPO is notoriously hard to tune. Reward, KL coefficient, learning rate, batch size all interact non-trivially.
Compute. RLHF requires the policy, reference, reward, and value models in memory simultaneously. Memory cost is roughly 4× SFT.
Reward model brittleness. A separately-trained reward model has its own generalization failures, and the policy will exploit them.

DPO eliminates the reward model and the RL loop entirely, achieving comparable alignment with simpler, more stable training. By 2025, most new alignment work uses DPO or its descendants (IPO, KTO, ORPO).

The post-RLHF lineage

DPO (Rafailov et al., 2023) — closed-form solution; no RM, no PPO. The new default.
IPO (Azar et al., 2023) — fixes DPO’s overfitting on deterministic preferences.
KTO (Ethayarajh et al., 2024) — works with binary “good or bad” labels, not paired comparisons.
ORPO (Hong et al., 2024) — folds preference and SFT into a single training stage.
RLAIF (Bai et al., 2022) — “feedback” comes from an LLM, not humans. Powers Constitutional AI and zELO-style reranker training.

You could train them jointly — the policy outputs token logits, slap a scalar head on top, share the trunk. People have tried. It almost never works in practice. Reasons:

Different optima. Predicting next tokens (language modeling) and predicting human preferences (reward) require different feature mixes; sharing weights makes both tasks worse.
Reward hacking gets worse. The policy sees “the gradient that would improve my reward” and moves directly toward exploiting RM quirks. Decoupling RM and policy creates an adversarial separation that bounds this.
Calibration drift. A reward head trained alongside a moving policy is hitting a moving target — it never converges to a stable scoring function.

So in practice the RM is initialized from the SFT model (so it has the same world model) but trained as a separate fixed model. The policy then optimizes against this frozen RM during PPO. The over-optimization risk is real but bounded by the KL penalty.

PPO (Proximal Policy Optimization, Schulman et al., 2017) is a gradient-clipping trick on top of vanilla policy gradient. The core problem with policy gradient: a single gradient step can move the policy too far, into a region where the value estimates are stale and the next gradient is nonsense. PPO solves this by clipping the ratio between new and old policy probabilities — if any token’s probability ratio drifts outside , its contribution to the loss is clipped.

The mechanic is conservative; the effect is stability. PPO trades sample efficiency (slower convergence) for the property that “one update doesn’t blow up the policy.” For LLM RLHF where each generation is expensive and the policy is enormous, that tradeoff is a clear win — and is why RLHF papers from 2022-2024 essentially universally use PPO over alternatives like A2C or vanilla REINFORCE.

Why it still matters

The pairwise-preference supervision shape RLHF normalized is now the substrate for almost every preference-based system: DPO and its variants, LLM-as-judge, calibrated reward models, and pairwise reranker training via zELO . The PPO loop went away; the data shape stayed.

Go further

Why is RLHF being replaced by DPO?

RLHF has three moving parts (reward model, value model, policy) and is famously unstable — small hyperparameter changes blow up training. DPO derives the same optimum directly from preference data without a reward model, with simpler training and comparable results. Most 2025+ alignment work skips RLHF for DPO or its variants.

DPO Reward modeling

Where does the human feedback actually enter?

Human annotators see (prompt, response_A, response_B) pairs and pick which response is better. Those pairwise preferences train the reward model; the LLM never sees humans directly. The pairwise shape is the same low-noise trick that powers zELO.

Pairwise preference zELO

What does the KL penalty do?

Without constraint, PPO will drift the policy arbitrarily far from the base model — exploiting reward-model quirks rather than producing genuinely better responses. The KL penalty keeps the policy close to the reference (instruction-tuned) model so the reward signal can't be over-optimized.

Reward modeling Instruction tuning

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs