Rl | knowledged.to

RLVR vs. the Agent Loop: Training-Time vs. Inference-Time

Distinguishes RLVR as training-time weight updates from inference-time agent verification loops.

Explains that RL in LLMs is a training/alignment stage, not inference, with pipeline context.

Teen-friendly explainer of reinforcement learning agents, rewards, exploration, delayed rewards, and applications.

Overview of PPO, the clipped policy-gradient RL algorithm used in RLHF for InstructGPT and original ChatGPT.

Critic-free RL algorithm that replaces PPO's value model with group-relative rewards for LLM fine-tuning.