Diffusion models are a class of generative AI models that learn to create data (images, audio, video, etc.) by learning to reverse a gradual noising process.

The Core Idea

The training process has two phases:

Forward process (destroying data): Take a real image and progressively add Gaussian noise over many steps (say, 1000 steps) until it becomes pure random noise. This is fixed and requires no learning.

Reverse process (learning to reconstruct): Train a neural network (usually a U-Net or Transformer) to predict and remove the noise at each step — essentially learning to “denoise.” At inference time, you start from pure noise and repeatedly apply this denoising to generate a new sample.

Why It Works

The model never learns to go from noise → image in one shot (too hard). Instead it learns a much simpler local question at each step: “given this slightly noisy image, what’s the noise I should subtract?” Chaining 1000 such small steps produces a coherent sample.

Key Variants

  • DDPM (Denoising Diffusion Probabilistic Models) — the foundational formulation (Ho et al., 2020)
  • DDIM — faster sampling by skipping steps, reducing inference from 1000 → ~50 steps
  • Latent Diffusion Models (LDM) — run the diffusion process in a compressed latent space rather than pixel space, dramatically cutting compute. This is what Stable Diffusion uses.
  • Classifier-Free Guidance (CFG) — technique to steer generation toward a text prompt by jointly training a conditioned and unconditioned model

How Text-to-Image Works

Models like Stable Diffusion, DALL-E 3, and Flux add conditioning: the denoising network also takes a text embedding (from a CLIP or T5 encoder) as input at every step. The network learns to denoise toward an image that matches the prompt, not just any coherent image.

Comparison to Other Generative Models

ModelMechanismTradeoffs
DiffusionReverse noisingHigh quality, slow sampling
GANGenerator vs. discriminatorFast, but training instability
VAEEncode → latent → decodeFast, but blurry outputs
Flow MatchingLearn a vector field (ODE)Cleaner math, increasingly dominant

Why They Became Dominant

Diffusion models produce significantly better sample quality and diversity than GANs, without the notorious training instability. The latent diffusion trick made them practical at scale, leading to the current generation of image/video/audio models (Stable Diffusion, Sora, Udio, etc.).

Flow matching (used in Flux, Stable Diffusion 3, and Meta’s models) is now emerging as a cleaner successor — same intuition, but learns a straight-line path through the data manifold rather than a noisy diffusion path.