GRPO — Group Relative Policy Optimization
GRPO — Group Relative Policy Optimization GRPO is a reinforcement learning algorithm introduced by DeepSeek (DeepSeekMath, later DeepSeek-R1) as a more efficient alternative to PPO for fine-tuning LLMs with RL. Core Idea PPO needs a separate value model (critic) of comparable size to the policy to estimate the baseline for advantage calculation. That doubles memory and compute. GRPO ditches the critic entirely. Instead, for each prompt it samples a group of G outputs from the current policy, scores each with the reward model, and uses the group’s mean and standard deviation as the baseline: ...