Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

This paper proposes Stepwise Diffusion Policy Optimization (SDPO), a novel reinforcement learning framework that aligns few-step diffusion models with specific objectives by introducing dual-state trajectory sampling and latent similarity-based dense reward prediction to enable low-variance, granular policy updates and superior reward-aligned image synthesis.

Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are teaching a talented but impatient artist to paint a masterpiece.

The Artist: This is a Few-Step Diffusion Model. Normally, these artists are trained to take 50 or 100 tiny, careful strokes to turn a blank canvas into a beautiful picture. But we want them to be faster. We want them to finish the painting in just 1, 2, or 4 strokes.

The Problem: When you force a painter to work that fast, they get messy. They might produce a blurry mess or a picture that doesn't look like what you asked for. If you try to teach them using standard methods (like Reinforcement Learning), it's like a teacher who only looks at the final painting after the artist is done.

  • If the artist makes a mistake in the very first stroke, the teacher doesn't notice until the end. By then, the whole painting is ruined, and the artist doesn't know which stroke caused the problem.
  • Because the "state space" (the number of possible mistakes) is so small when you only have 4 steps, the teacher gets confused, and the artist gets frustrated. They can't learn effectively.

The Solution: SDPO (Stepwise Diffusion Policy Optimization)
The authors of this paper invented a new teaching method called SDPO. Think of it as a "Super-Teacher" with a special set of tools.

1. The "Dual-State" Glasses (Seeing Two Things at Once)

Normally, a teacher only sees the messy, half-finished canvas (the "noisy state").
SDPO gives the teacher a pair of special glasses. Through these glasses, the teacher can see two things simultaneously:

  1. The messy canvas right now.
  2. A crystal-clear preview of what the final painting would look like if the artist stopped right there.

This is huge. Even if the artist only made one stroke, the teacher can see a "ghost image" of the final result. This allows the teacher to give feedback immediately after that first stroke, rather than waiting until the end.

2. The "Smart Guessing" Strategy (Dense Reward Prediction)

Giving feedback on every single stroke is expensive and slow (like hiring a critic to look at the painting 50 times).
SDPO uses a clever trick:

  • The teacher only looks closely at three specific moments: the very first stroke, the very last stroke, and one "mystery" moment in the middle.
  • For all the other moments in between, the teacher uses Latent Similarity. Imagine the teacher knows that if the painting looks 80% like the first stroke and 20% like the last, the quality must be somewhere in between. They "guess" the score for the middle steps based on how similar they look to the ones they actually checked.
  • This saves time and money while still giving the artist a constant stream of feedback.

3. The "Difference" Lesson (Reward Difference Learning)

Instead of just saying "This painting is a 7 out of 10," SDPO teaches the artist by comparison.

  • The teacher shows the artist two paintings side-by-side: one that turned out great and one that turned out bad.
  • The teacher says, "Look at the difference in quality between these two. Now, look at the difference in your brushstrokes. Change your strokes to match the difference in quality."
  • This "difference learning" is much more stable and less confusing for the artist than trying to hit a perfect score from scratch.

4. The "Shuffled" Practice (Step-Shuffled Updates)

Usually, artists practice in order: Stroke 1, then Stroke 2, then Stroke 3.
SDPO mixes this up. It tells the artist to practice Stroke 3, then Stroke 1, then Stroke 2, in random orders.

  • Why? This prevents the artist from getting "stuck" in a routine where they only learn to fix mistakes in a specific sequence. It makes the learning more robust and adaptable.

The Result

When you put all these tools together, the "impatient artist" (the few-step model) learns incredibly fast.

  • Before: The artist made blurry, low-quality images when forced to work in 1 or 2 steps.
  • After (with SDPO): The artist produces high-quality, sharp, and beautiful images even in just 1 or 2 steps, perfectly matching what the user asked for.

In a nutshell: SDPO is a smarter way to train fast AI artists. Instead of waiting until the end to critique their work, it gives them instant, detailed, and comparative feedback on every single step, helping them master the art of "speed painting" without losing quality.