Distillation in AI (also called knowledge distillation) is a model compression technique where a smaller “student” model is trained to mimic the behavior of a larger, more capable “teacher” model.

How it works

Instead of training the student on hard labels (e.g., “this image is a cat”), the student learns from the teacher’s soft outputs — the probability distribution the teacher assigns across all classes. These soft outputs carry richer information. For example, knowing a model thinks an image is 70% cat, 20% leopard, and 10% tiger tells the student more about the underlying structure than just “cat.”

Why it matters

Large models are expensive to run. Distillation lets you compress their “knowledge” into a smaller model that:

  • Is faster and cheaper to serve
  • Uses less memory
  • Often performs surprisingly close to the original

Common applications

  • Edge deployment — running models on phones or IoT devices
  • SpecDecoding — a large model verifies outputs from a smaller draft model to speed up inference
  • LLM training — newer, smaller models trained on outputs from larger frontier models (e.g., DeepSeek’s R1 distilled variants were trained on reasoning traces from a larger model)
  • Task-specific compression — fine-tuning a general large model into a small specialist

A nuance worth knowing

There’s a distinction between distillation from logits (the raw probability outputs) versus distillation from reasoning traces or chain-of-thought — the latter is more common in modern LLM work, where the student learns to replicate the teacher’s step-by-step reasoning rather than just final token probabilities.

In short: distillation is how the AI field takes big expensive models and squeezes their capabilities into small, deployable ones.