Rlhf | knowledged.to

The Modern LLM Training Pipeline

Explains the four-stage modern LLM training pipeline from pre-training through verifiable-reward RL.

The Modern LLM Training Pipeline

Explains the four-stage modern LLM training pipeline from pre-training through verifiable-reward RL.

Where RL Fits: Training vs. Inference in the LLM Pipeline

Explains that RL in LLMs is a training/alignment stage, not inference, with pipeline context.

Model Drift

Overview of model drift, detection, mitigation, and LLM-specific issues like knowledge staleness and provider drift.

PPO — Proximal Policy Optimization

Overview of PPO, the clipped policy-gradient RL algorithm used in RLHF for InstructGPT and original ChatGPT.

GRPO — Group Relative Policy Optimization

Critic-free RL algorithm that replaces PPO's value model with group-relative rewards for LLM fine-tuning.

LLM as Judge

Using a language model to evaluate another model's outputs as a scalable proxy for human preference judgments.

Fine-Tuning Techniques for LLMs

Comprehensive guide to LLM fine-tuning methods including full, parameter-efficient, and preference-based approaches with modern recipes and tools like LoRA and DPO