The Modern LLM Training Pipeline

The modern LLM training pipeline has four main stages, each producing a more capable and aligned model from the one before.

1. Pre-training

The model learns language and world knowledge by predicting the next token over a massive corpus (trillions of tokens of web text, code, books). This is self-supervised — no human labels — and accounts for the vast majority of compute. The output is a “base model” that has broad knowledge but isn’t good at following instructions; it just continues text. Key levers here are data quality/filtering, scale (parameters × tokens, governed by scaling laws), and architecture (a decoder-only Transformer).

2. Supervised fine-tuning (SFT) / instruction tuning

The base model is fine-tuned on a smaller, curated set of high-quality prompt–response pairs demonstrating desired behavior (answering questions, following instructions, formatting, refusing harmful requests). This teaches the model the form of being a helpful assistant. Data is often human-written or carefully filtered, and increasingly includes synthetic data generated and vetted by other models.

3. Preference optimization / alignment (RLHF and successors)

The model is tuned to match human preferences about which response is better. The classic recipe is RLHF: humans rank model outputs, a reward model is trained on those rankings, and the LLM is optimized against that reward with reinforcement learning (PPO). Lighter-weight alternatives like DPO (Direct Preference Optimization) skip the separate reward model and optimize directly on preference pairs. This stage shapes helpfulness, tone, safety, and honesty. Constitutional AI / RLAIF variants replace some human feedback with AI-generated feedback against a written set of principles.

4. Reasoning / RL on verifiable rewards (the newer stage)

The most recent addition trains models to reason through long chains of thought, optimized with reinforcement learning against automatically checkable rewards — math problems with known answers, unit-tested code, verifiable tasks. This is what produces “reasoning models” that think before answering. It scales differently because correctness can be checked programmatically rather than by humans.

Surrounding all stages

Data curation is arguably the highest-leverage work at every stage — deduplication, quality filtering, decontamination, and mixing ratios.
Evaluation and red-teaming run continuously to measure capability and catch safety failures.
Post-training also includes tool-use/function-calling training, long-context extension, and distillation (training a smaller model to mimic a larger one).

Mental model

Pre-training builds raw knowledge and capability, while everything after it (“post-training”) is about eliciting and shaping that capability into a safe, useful assistant. Most of the cost is in stage 1, but most of the felt quality of a modern model comes from stages 2–4.

1. Pre-training#

2. Supervised fine-tuning (SFT) / instruction tuning#

3. Preference optimization / alignment (RLHF and successors)#

4. Reasoning / RL on verifiable rewards (the newer stage)#

Surrounding all stages#

Mental model#