PPO — Proximal Policy Optimization

PPO — Proximal Policy Optimization PPO is a reinforcement learning algorithm from OpenAI (Schulman et al., 2017) that became the default workhorse for RLHF — it’s what trained InstructGPT and the original ChatGPT. Core Idea Policy gradient methods are unstable because a single large update can collapse the policy. PPO fixes this by staying close to the previous policy on each update — the “proximal” part. It does this with a clipped surrogate objective: ...

May 19, 2026 · 2 min

GRPO — Group Relative Policy Optimization

GRPO — Group Relative Policy Optimization GRPO is a reinforcement learning algorithm introduced by DeepSeek (DeepSeekMath, later DeepSeek-R1) as a more efficient alternative to PPO for fine-tuning LLMs with RL. Core Idea PPO needs a separate value model (critic) of comparable size to the policy to estimate the baseline for advantage calculation. That doubles memory and compute. GRPO ditches the critic entirely. Instead, for each prompt it samples a group of G outputs from the current policy, scores each with the reward model, and uses the group’s mean and standard deviation as the baseline: ...

May 19, 2026 · 2 min

Fine-Tuning Techniques for LLMs

Fine-Tuning Techniques for LLMs Fine-tuning techniques can be grouped along a few axes: what you optimize (full weights vs. small additions), what signal you train on (labels, instructions, preferences, rewards), and how the data is generated (human, synthetic, AI-judged). Full Fine-Tuning (FFT) Update every parameter in the model on a target dataset. Highest capacity, but expensive in memory and prone to catastrophic forgetting. Mostly reserved for smaller models or when you have lots of high-quality data and compute. ...

April 25, 2026 · 4 min

Unsloth Studio — Fine-tuning Dataset Formats

Unsloth Studio — Fine-tuning Dataset Formats Unsloth Studio supports several dataset formats depending on your fine-tuning goal. Files can be uploaded directly as JSONL, JSON, CSV, Parquet, PDF, or DOCX. Format Overview 1. Raw Text (Continued Pretraining) Used to inject domain knowledge without any structure. The model learns from continuous prose. T h e m i t o c h o n d r i a i s t h e p o w e r h o u s e o f t h e c e l l . A T P s y n t h e s i s o c c u r s v i a o x i d a t i v e p h o s p h o r y l a t i o n . . . Best for: books, articles, documentation dumps, codebases. ...

April 23, 2026 · 5 min