PPO — Proximal Policy Optimization
PPO is a reinforcement learning algorithm from OpenAI (Schulman et al., 2017) that became the default workhorse for RLHF — it’s what trained InstructGPT and the original ChatGPT.
Core Idea
Policy gradient methods are unstable because a single large update can collapse the policy. PPO fixes this by staying close to the previous policy on each update — the “proximal” part. It does this with a clipped surrogate objective:
Where:
r(θ)= probability ratio between new and old policyA= advantage (how much better an action was than the baseline)ε= clip range, typically 0.1–0.2
If the new policy tries to change the probability of an action by more than ε, the gradient gets clipped — preventing destructive updates while still allowing improvement.
In RLHF Specifically
PPO uses four models loaded simultaneously:
- Policy — the LLM being trained
- Reference — frozen copy of the policy for the KL penalty (keeps the model from drifting too far from its SFT origin)
- Reward model — scores completions
- Value model (critic) — estimates expected return for advantage calculation
That fourth model is exactly what GRPO eliminates by using group-relative baselines instead.
Why It Dominated
- Simpler than TRPO (its predecessor, which used a hard KL constraint via constrained optimization).
- More stable than vanilla policy gradient.
- Works well across a huge range of tasks — robotics, games, and LLM fine-tuning all use the same algorithm with minimal changes.
Limitations
- Memory-heavy: four models in GPU memory at once.
- Critic is hard to train with sparse/delayed rewards — common in RLHF where the reward only comes at end of generation.
- Hyperparameter-sensitive: KL coefficient, clip range, value loss weighting all need tuning.
These limitations motivated alternatives like DPO (no RL at all, direct preference optimization on pairs) and GRPO (drops the critic).