RLHF and DPO: Aligning AI to Human Preferences

Both techniques address the same core problem: after pre-training on raw text, a language model needs to be steered toward responses that are helpful, safe, and aligned with what humans actually want. They’re two different approaches to the same goal.

RLHF — Reinforcement Learning from Human Feedback

The idea: Train a separate model to predict what humans prefer, then use that model as a reward signal to fine-tune the LLM via RL.

The pipeline:

Supervised Fine-Tuning (SFT): Start with the base LLM and fine-tune it on a curated set of high-quality prompt-response pairs to get a reasonable baseline.
Reward Model Training: Human annotators are shown pairs of model responses and asked which one is better. These preferences train a separate “reward model” (RM) that learns to score any response.
RL Optimization: The LLM is then optimized using PPO (Proximal Policy Optimization) — an RL algorithm — to generate responses that maximize the reward model’s score, while a KL-divergence penalty keeps it from drifting too far from the SFT baseline.

Strengths:

Proven at scale (used by InstructGPT, early ChatGPT, Gemini)
Can capture nuanced human preferences

Weaknesses:

Complex, brittle pipeline — three separate models to train
PPO is notoriously unstable and compute-intensive
Reward hacking: the LLM can learn to “game” the reward model without actually being better
Requires significant infrastructure and careful tuning

DPO — Direct Preference Optimization

The idea: Skip the reward model entirely. Mathematically reformulate the RLHF objective so the LLM itself is the reward model — optimized directly from preference data.

The insight: The optimal policy under the RLHF objective has a closed-form relationship to the reward function. DPO (Rafailov et al., 2023) exploits this to derive a loss function that works directly on preference pairs (chosen vs. rejected responses), without ever training a separate RM or running RL.

The pipeline:

SFT baseline (same as RLHF)
Preference data — the same human-labeled pairs (chosen/rejected), but fed directly into a classification-style loss on the LLM

The loss function intuitively: Increase the relative likelihood of the chosen response over the rejected one, scaled by how confidently the reference model (SFT baseline) distinguishes them.

Strengths:

Much simpler — one training stage, standard supervised loss
More stable training, lower compute cost
No reward hacking (no separate RM to game)
Increasingly competitive with RLHF at smaller scales

Weaknesses:

Requires high-quality offline preference data (no online exploration)
Can underperform RLHF on very complex tasks where exploration matters
Sensitive to the quality and distribution of the preference dataset

Side-by-Side Comparison

	RLHF	DPO
Reward model	Explicit, separately trained	Implicit, baked into LLM loss
Training algorithm	PPO (RL)	Supervised cross-entropy-style loss
Complexity	High (3 models, RL loop)	Low (1 model, 1 loss)
Stability	Fragile	Stable
Compute cost	High	Lower
Online exploration	Yes	No (offline only)
Used by	InstructGPT, early ChatGPT	Llama 3, Mistral, many open models

Where Things Stand

DPO has become the dominant approach in open-source alignment (Llama 3, Mistral, Phi, etc.) because of its simplicity. However, frontier labs (OpenAI, Google DeepMind) still use variants of RLHF — often combining offline DPO-style methods with online RL for the best of both worlds. Hybrid approaches like RLHF with DPO initialization or online DPO are active research areas.