RLHF and DPO: Aligning AI to Human Preferences
RLHF and DPO: Aligning AI to Human Preferences Both techniques address the same core problem: after pre-training on raw text, a language model needs to be steered toward responses that are helpful, safe, and aligned with what humans actually want. They’re two different approaches to the same goal. RLHF — Reinforcement Learning from Human Feedback The idea: Train a separate model to predict what humans prefer, then use that model as a reward signal to fine-tune the LLM via RL. ...