Denoising Diffusion Probabilistic Models

This paper presents high-quality image synthesis using denoising diffusion probabilistic models, achieving state-of-the-art results on CIFAR10 and LSUN datasets through a novel training objective connecting diffusion models to denoising score matching with Langevin dynamics.

Jonathan Ho, Ajay Jain, Pieter Abbeel

Published 2020-06-19
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a computer to draw a picture of a cat. Most AI models try to learn by looking at thousands of photos and trying to guess the next pixel, or by fighting against another AI in a game of "fake vs. real."

This paper introduces a different approach called Denoising Diffusion Probabilistic Models (or just "Diffusion Models"). Here is the simple, everyday explanation of how it works, using a few creative analogies.

1. The Two-Step Dance: The Messy Room and the Tidy Room

Think of the training process as a two-part dance between two states: Chaos and Order.

Step A: The Messy Room (The Forward Process)
Imagine you have a pristine, beautiful painting of a cat. Now, imagine a mischievous child (the "diffusion process") who slowly, step-by-step, throws dust, smudges, and random noise onto the painting.

  • At first, the cat is still clear.
  • After a few steps, it's a bit blurry.
  • After 1,000 steps, the painting is completely covered in static noise. You can't see the cat at all; it looks like pure TV static.

The AI watches this happen. It doesn't need to learn how to make the mess; the rules of the mess are fixed. It just observes how the image turns into noise.

Step B: The Tidy Room (The Reverse Process)
Now, the AI has to learn the magic trick: How to un-mess the room.
The AI starts with a blank canvas full of static noise (pure chaos). It has to figure out how to remove the noise, step-by-step, to reveal the cat underneath.

  • It looks at the noise and asks, "If I remove a little bit of this static, what does the underlying image look like?"
  • It peels away the noise layer by layer.
  • Eventually, after 1,000 steps of cleaning, the static is gone, and a perfect, high-quality cat appears.

The paper's big breakthrough is that the AI gets really, really good at this "cleaning" job.

2. The Secret Sauce: Predicting the "Noise"

How does the AI know what to clean?

In the past, people tried to teach the AI to predict the final image directly from the noise. But the authors found a smarter way. Instead of asking, "What is the cat?", they ask, "What is the noise?"

Imagine you are looking at a muddy window.

  • Old way: Try to guess the exact shape of the tree outside.
  • New way (This paper): Look at the mud on the glass and guess exactly where the mud is and how thick it is.

Once the AI knows exactly where the "mud" (noise) is, it can simply subtract it. If you know the noise, you can easily find the picture. The paper shows that training the AI to predict the noise is much easier and leads to much better pictures.

3. The "Progressive Reveal" (Like a Slow-Motion Movie)

One of the coolest things about this model is how it generates images. It's not like a printer that spits out a finished page instantly. It's more like a time-lapse video of a sculpture being carved.

  • Start: A blob of random static.
  • Middle: You start to see vague shapes. Maybe a round head, two ears. It's like looking at a cloud that vaguely looks like a face.
  • End: The details sharpen. You see the whiskers, the fur texture, the eyes.

The paper calls this "progressive lossy decompression." It's like receiving a low-resolution JPEG that slowly gets higher and higher resolution as more data arrives. The AI builds the image from the "big picture" down to the tiny details.

4. Why is this a Big Deal?

Before this paper, other AI models (like GANs) were great at making pictures, but they were hard to train and sometimes unstable (they would crash or produce weird, blurry results).

This paper showed that:

  1. The pictures are stunning: On standard tests, their AI produced images that were sharper and more realistic than almost anything else at the time (beating the previous champions).
  2. It's stable: The training process is much more reliable. It doesn't "fight" like other models; it just learns to clean up noise.
  3. It's versatile: Because the math is so clean, it can be used for many things, not just cats.

The Bottom Line

Think of this paper as teaching a computer to be a master restorer. Instead of trying to paint a masterpiece from scratch, the computer learns to look at a ruined, noisy canvas and perfectly reconstruct the original masterpiece by knowing exactly what the noise looks like and removing it.

The result? A computer that can generate incredibly realistic images, one tiny step of "cleaning" at a time.