Fine-Tuning Techniques for LLMs

Fine-tuning techniques can be grouped along a few axes: what you optimize (full weights vs. small additions), what signal you train on (labels, instructions, preferences, rewards), and how the data is generated (human, synthetic, AI-judged).

Full Fine-Tuning (FFT)

Update every parameter in the model on a target dataset. Highest capacity, but expensive in memory and prone to catastrophic forgetting. Mostly reserved for smaller models or when you have lots of high-quality data and compute.

Supervised Fine-Tuning (SFT) / Instruction Tuning

Train on (prompt, ideal_response) pairs with standard cross-entropy. “Instruction tuning” is just SFT where the dataset is a mix of tasks phrased as instructions (FLAN, Alpaca, etc.). This is almost always step one in the post-training pipeline before any preference work.

Continued Pretraining / Domain-Adaptive Pretraining

Same objective as pretraining (next-token prediction) but on a domain corpus — code, legal, medical, your company’s docs. Done before SFT when you need to inject knowledge rather than behavior.

Parameter-Efficient Fine-Tuning (PEFT)

Freeze the base model, train a small number of new or selected parameters. The dominant family in practice:

  • LoRA — inject trainable low-rank matrices A and B such that ΔW = BA is added to frozen weights. Trains <1% of parameters with near-FFT quality.
  • QLoRA — load the base in 4-bit (NF4) and train LoRA adapters on top. Lets you fine-tune 65B+ models on a single GPU.
  • DoRA (Weight-Decomposed LoRA) — decomposes weights into magnitude and direction, applies LoRA only to direction. Closes more of the gap to FFT.
  • Adapters (Houlsby, Pfeiffer) — small bottleneck MLPs inserted between transformer layers. Predates LoRA; LoRA mostly replaced it because it adds no inference latency.
  • Prefix Tuning — prepend trainable vectors to the keys/values at every layer.
  • Prompt Tuning / Soft Prompts — only learn embeddings prepended to the input; cheapest, weakest.
  • P-Tuning v2 — prefix tuning generalized across all layers, competitive with FFT on many tasks.
  • IA³ — learn three vectors per layer that rescale keys, values, and FFN activations. Even fewer parameters than LoRA.
  • BitFit — train only the bias terms. Surprisingly decent baseline.

Preference / Alignment Fine-Tuning

After SFT, you align the model to preferences over responses:

  • RLHF (PPO) — train a reward model on human-ranked pairs, then use PPO to maximize that reward with a KL penalty against the SFT model. The original ChatGPT recipe. Complex, unstable, expensive.
  • DPO (Direct Preference Optimization) — skips the reward model entirely; derives a closed-form loss directly on preference pairs. Much simpler and now the default for most open-source alignment.
  • IPO, KTO, ORPO, SimPO — variants that fix specific DPO failure modes (overfitting, needing paired data, requiring a separate SFT stage, etc.). ORPO is notable for combining SFT and preference learning into one stage.
  • GRPO (Group Relative Policy Optimization) — drops the value/critic network of PPO; instead samples a group of completions per prompt and uses their relative rewards as advantages. Used in DeepSeek-R1. Memory-efficient and works well when you have a verifiable reward signal.
  • RLAIF / Constitutional AI — same loop as RLHF, but preferences come from an AI judge guided by a written constitution rather than human labelers.

Reasoning / Verifiable-Reward Fine-Tuning

The newer branch (DeepSeek-R1, OpenAI o-series style). Use RL (often GRPO) with rule-based, verifiable rewards — does the math answer match? does the code pass the tests? — to elicit long chain-of-thought without needing a learned reward model. Can be combined with distilling the resulting reasoning traces into smaller models.

Knowledge Distillation

Train a student to match a teacher’s outputs. In modern LLM practice this usually means SFT on the teacher’s generated responses (and increasingly its reasoning traces), rather than the original logit-matching formulation. The DeepSeek-R1 distilled variants are the canonical recent example.

Multi-task and Mixture Fine-Tuning

Train on a mixture of tasks/datasets simultaneously, often with task-specific prompt templates. T0, FLAN-T5, and most modern instruction-tuned models do this. Helps generalization but requires careful data balancing.

Typical Modern Recipe

For a chat model: pretraining → continued pretraining (optional) → SFT → DPO (or RLHF, or GRPO if reasoning-focused), with LoRA/QLoRA used at any stage where you want to save compute.