Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Imagine you are trying to teach a robot to draw a picture of a horse.

The Old Way (Traditional Diffusion Models):
Currently, most AI art generators work like a game of "Telephone" played in reverse.

The Mess: You start with a completely blank canvas covered in static TV noise (like snow on an old TV).
The Guess: The AI has to guess what the picture looks like underneath the noise. But here's the catch: in the beginning, the noise is so loud that the AI is essentially guessing in the dark. It has to take thousands of tiny, cautious steps to slowly peel away the static until a horse appears.
The Problem: This process is slow. It's like trying to find a needle in a haystack by moving one grain of hay at a time. Also, the math the AI uses to "peel" the noise gets very messy and unstable at the very start and very end of the process, forcing it to take even more steps to get it right.

The New Way (This Paper's Solution):
The authors, Zhenkai Zhang and his team, came up with a smarter way to teach the robot. They introduced two main tricks:

Trick 1: The "Smooth Slide" (Better Math)

Imagine the old method was like walking down a staircase where the first and last steps are missing. You have to jump or stumble to get on or off, which is clumsy and slow.

The authors redesigned the "stairs" into a smooth, curved slide.

Instead of using a standard math formula that gets messy at the start and finish, they used a special angle-based formula (like moving along a quarter-circle arc).
Why it helps: This removes the "stumbling blocks" (singularities). Now, the AI can slide smoothly from pure noise to a clear image. Because the path is so smooth, the AI can take bigger, faster steps (using advanced math tools called Runge-Kutta solvers) without falling off the track. It's the difference between walking carefully on a rocky path and gliding down a smooth slide.

Trick 2: The "Two-Eyed Detective" (Simultaneous Estimation)

In the old method, the AI had to choose: "Do I guess what the noise is, or do I guess what the picture is?"

If it guesses the noise first, it's great at the end when the picture is clear, but terrible at the beginning when it's just static.
If it guesses the picture first, it's great at the beginning when the image is visible, but gets confused when the noise takes over later.

The authors' new model is like a detective with two pairs of eyes.

It looks at the messy image and simultaneously guesses: "Okay, I think the noise is this, and the underlying picture is that."
By doing both at the same time, the AI gets a much better "map" of where it needs to go. It knows exactly how much to subtract (the noise) and how much to keep (the image) at every single moment. This makes the process much more stable and accurate.

The Result: Faster and Sharper

Because of these two tricks:

Speed: The AI generates high-quality images much faster. In the paper, they showed that their model could turn pure noise into a recognizable horse in about 150 steps, whereas the old models needed 400 to 500 steps to get the same result. That's 3 times faster.
Quality: The images are clearer and more detailed, even when the AI is forced to take fewer steps.
Efficiency: The model learns faster during training, needing fewer "practice runs" to become an expert.

In a Nutshell:
The authors took a slow, clunky process of "cleaning up noise" and turned it into a smooth, high-speed slide where the AI acts like a super-smart detective, cleaning the picture and seeing the image at the same time. The result? You get beautiful, realistic art in a fraction of the time it used to take.

1. Problem Statement

Diffusion models have achieved state-of-the-art results in image generation but face two primary limitations:

Inference Efficiency: Traditional noise-based models (e.g., DDPM) require a large number of sampling steps to transition from pure noise to high-quality images. The initial stages of sampling are particularly inefficient due to the difficulty of learning from noise-dominated data.
Training/Estimation Trade-offs:
- Noise-based models (predicting $\epsilon$ ) struggle in early sampling stages where the signal is weak.
- Image-based models (predicting $x_0$ directly) struggle in later stages where the input is dominated by noise, making direct image estimation difficult and unstable.
Mathematical Singularities: The standard parameterization of the diffusion process ( $\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$ ) creates singularities at the boundaries ( $t=0$ and $t=T$ ) when computing gradients, limiting the effectiveness of higher-order Ordinary Differential Equation (ODE) solvers.

2. Methodology

The authors propose a novel framework that combines the strengths of noise-targeted and image-targeted training while reformulating the mathematical underpinnings of the diffusion process.

A. Reparameterization via Angular Mapping

The authors replace the standard square-root parameterization with an angular parameterization based on a quarter-circular arc.

Original: $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$
Proposed: $x_t = \cos(\eta_t)x_0 + \sin(\eta_t)\epsilon$ , where $\eta_t = \frac{t}{T}\frac{\pi}{2}$ .
Benefit: This mapping eliminates the singularities found in the derivative of the standard formula at $t=0$ and $t=T$ . Consequently, the reverse diffusion process can be expressed as a well-behaved ODE, enabling the use of high-order solvers like Runge-Kutta (RK2, RK4) instead of simple Euler steps.

B. Simultaneous Estimation of Image and Noise

Instead of training the network to predict either the noise ( $\epsilon$ ) or the clean image ( $x_0$ ), the model is trained to predict both simultaneously.

Loss Function: The objective function combines the reconstruction errors for both targets:
$\min_\theta \mathbb{E} [\|R_\theta(x_t, t) - x_0\| + \|\epsilon_\theta(x_t, t) - \epsilon\|]$
Rationale:
- In early stages (high noise), the image prediction provides meaningful structural guidance.
- In later stages (low noise), the noise prediction refines the details.
- Joint estimation stabilizes the gradient calculations across all timesteps.

C. Gradient-Based Sampling

The authors conceptualize the reverse diffusion process as an iterative optimization problem using gradient descent.

They derive the ground-truth gradient of the trajectory and the estimated gradient based on the network's predictions.
A gradient loss term is added to the training objective to align the estimated trajectory with the true ODE flow:
$\min_\theta \mathbb{E} [\|R_\theta - x_0\| + \|\epsilon_\theta - \epsilon\| + \gamma\|\hat{\dot{x}} - \dot{x}\|]$
During inference, the update step utilizes these gradients (potentially with Runge-Kutta methods) to move more accurately and stably from noise to image.

3. Key Contributions

Novel Noise Scheduler & Parameterization: The introduction of the $\cos(\eta)x_0 + \sin(\eta)\epsilon$ parameterization removes mathematical singularities, allowing for smoother diffusion evolution and the application of higher-order ODE solvers.
Dual-Target Training: A new training paradigm that simultaneously estimates the clean image and the noise. This overcomes the limitations of single-target models, providing better stability and control throughout the entire sampling trajectory.
Gradient-Enhanced Sampling: The integration of gradient information into the loss function and sampling steps improves the convergence speed and stability of the generation process.

4. Experimental Results

The model was evaluated on CIFAR-10, CelebA, and LUSH (Church) datasets against baselines DDPM, DDIM, and Cold Diffusion.

Quality Metrics: The proposed model outperformed baselines in Fréchet Inception Distance (FID), spatial FID (sFID), Precision, and Recall.
- Example: On CIFAR-10 with 50 steps, the proposed model achieved an FID of 4.57, compared to DDIM's 7.08 and DDPM's 5.62.
Convergence Speed:
- The model converges to high-quality images significantly faster. Visual analysis showed that the proposed model could identify a clear object (e.g., a "horse") in ~150 steps, whereas DDPM/DDIM required ~400–500 steps for the same clarity.
- Training Efficiency: On the LUSH dataset, the proposed model achieved performance comparable to DDPM/DDIM with only 1.13 million iterations, whereas the baselines required over 4.4 million iterations.
Ablation Studies:
- Combining the new noise schedule ( $\beta^*$ or $\sin()$ ) with simultaneous estimation ( $\hat{x}_0, \hat{\epsilon}$ ) yielded the best results.
- Simultaneous estimation alone improved performance at low step counts (<20), while the new noise schedule was crucial for maintaining performance at higher step counts by balancing the loss contributions.

5. Significance

This work addresses the critical bottleneck of inference time in diffusion models without sacrificing quality. By reformulating the diffusion process as a singularity-free ODE and leveraging dual-target learning, the authors enable:

Faster Generation: High-quality images can be generated with fewer sampling steps.
Reduced Training Costs: The model converges faster during training, reducing the computational resources required for large-scale datasets.
Improved Stability: The simultaneous estimation approach provides a more robust gradient signal, making the generation process less prone to artifacts and instability, particularly in the early and late stages of sampling.

The code is publicly available, facilitating further research into efficient diffusion sampling strategies.

Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Trick 1: The "Smooth Slide" (Better Math)

Trick 2: The "Two-Eyed Detective" (Simultaneous Estimation)

The Result: Faster and Sharper

1. Problem Statement

2. Methodology

A. Reparameterization via Angular Mapping

B. Simultaneous Estimation of Image and Noise

C. Gradient-Based Sampling

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction