TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Imagine you are teaching a talented but impatient artist to paint.

The Artist: This is a "few-step" AI model. Unlike traditional AI that takes 100 slow, careful strokes to paint a picture, this artist can create a masterpiece in just 4 quick strokes. They are incredibly fast and efficient, which is great for real-world apps.

The Problem: The artist is fast, but they aren't always perfect. They might draw a cat with six legs, or write "Hello" as "Helo." To fix this, we want to give them feedback.

The Old Way (The Broken Feedback Loop):
Previously, to teach an AI, you needed a teacher who could explain exactly how to fix a mistake mathematically. If the artist drew a cat with six legs, the teacher had to say, "Move the leg 2 pixels to the left."
But in the real world, feedback isn't always that precise. Sometimes the feedback is just: "I like this picture" or "I hate this picture." Or, "There are too many dogs."
The old AI methods couldn't understand these simple "Yes/No" or "Good/Bad" signals because they couldn't mathematically trace the error back through the painting process. They were stuck waiting for a perfect, mathematical explanation that didn't exist.

The New Solution: TDM-R1 (The "Smart Coach")
The authors of this paper created a new training method called TDM-R1. Think of it as a revolutionary coaching system that solves the "imprecise feedback" problem. Here is how it works, using a simple analogy:

1. The "Deterministic Path" (The Fixed Blueprint)

Most fast painters work in a chaotic way; if you ask them to paint a dog, they might start with a different random sketch every time. This makes it hard to know which specific stroke caused the mistake.

TDM-R1 forces the artist to follow a strict, predictable blueprint. Every time they paint a dog, they start from the exact same messy sketch and follow the exact same path to the final image.

Why this matters: Because the path is fixed, the coach can look at the painting at every single step (not just the end) and say, "Ah, at step 2, you started drawing the ear wrong." This turns a vague "Bad picture" into specific, actionable advice for every step of the process.

2. The "Surrogate Reward" (The Translator)

The coach still only speaks in simple "Good/Bad" signals (non-differentiable rewards). But the artist speaks in complex math.
TDM-R1 introduces a Translator (called the Surrogate Reward).

The Coach says: "I like the dog with the blue collar."
The Translator watches the artist's 4-step process and learns to say: "Hey artist, when you are at step 2, if you make the collar blue, you get a 'Good' score. If you make it red, you get a 'Bad' score."
The Translator learns this by watching groups of paintings, figuring out which steps lead to the "Good" outcomes and which lead to "Bad" ones.

3. The "Dynamic Loop" (The Evolving Partnership)

Here is the magic trick: The Translator and the Artist learn together.

The Artist tries to paint better to please the Translator.
As the Artist gets better, the Translator gets smarter at spotting tiny details.
They keep pushing each other. It's like a dance where the music gets faster and the steps get more complex, but they never lose the rhythm.

The Results: Why is this a big deal?

The paper shows that this method is a game-changer:

Speed vs. Quality: Usually, you have to choose between speed (4 steps) and quality (100 steps). TDM-R1 proved you can have both. Their 4-step model became better than the slow, 100-step models.
Real-World Skills: They tested it on hard tasks like:
- Counting: "Draw 5 dogs." (The old models often drew 3 or 7). The new model got it right almost every time.
- Text: "Write 'TDM-R1' on a sign." The new model spelled it perfectly.
- Positioning: "Put a cat to the right of a dog." The new model understood the spatial relationship perfectly.

The Bottom Line

Before this paper, if you wanted an AI to learn from simple human feedback (like "I like this" or "Count the apples"), you had to use slow, expensive AI models.

TDM-R1 is like giving a super-fast, 4-step artist a "superpower" to understand simple human feedback. It allows them to learn from real-world preferences without needing a math genius to explain every mistake. The result is an AI that is fast, cheap, and incredibly smart, capable of following complex instructions better than even the slowest, most expensive models.

Here is a detailed technical summary of the paper "TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward."

1. Problem Statement

Few-step generative models (e.g., diffusion distillation) have achieved ultra-fast image and video generation (up to 50x acceleration) but still struggle with complex instruction following, precise text rendering, and object positioning. While Reinforcement Learning (RL) has successfully improved Large Language Models (LLMs) and standard diffusion models, applying RL to few-step diffusion models remains an unsolved challenge due to a critical limitation:

The Differentiability Bottleneck: Existing RL paradigms for diffusion models require reward signals to be differentiable so that gradients can be backpropagated through the reward model to the generator.
Real-World Limitation: This excludes the vast majority of valuable real-world reward signals, such as human binary preferences, discrete object counts, and text correctness (verified via OCR), which are inherently non-differentiable.
The Variance Issue: Standard RL methods often assign a single reward to the entire generation trajectory based on the final output. In few-step models, this creates high variance and bias when trying to assign credit to intermediate denoising steps, leading to blurry outputs and performance degradation.

2. Methodology: TDM-R1

The authors propose TDM-R1, a novel RL paradigm built upon Trajectory Distribution Matching (TDM), a leading few-step diffusion model. TDM-R1 decouples the learning process into two main components: Surrogate Reward Learning and Generator Learning.

A. Accurate Intermediate Reward Estimation via Deterministic Trajectories

Unlike standard diffusion models that use stochastic sampling (SDE), TDM utilizes deterministic trajectories (ODE sampling).

Mechanism: Because the path from noise ( $x_t$ ) to the clean image ( $x_0$ ) is deterministic, the conditional probability $p(x_0|x_t)$ becomes a Dirac distribution.
Benefit: This allows the model to obtain an unbiased estimate of the reward for every intermediate step along the trajectory, rather than just the final step. This significantly reduces the variance of reward estimation compared to stochastic methods.

B. Surrogate Reward Learning (The Core Innovation)

To handle non-differentiable rewards (e.g., "Does this image contain 3 dogs?"), TDM-R1 learns a differentiable Surrogate Reward ( $\tilde{r}_\phi$ ) parameterized by a diffusion model.

Group-Based Preference Optimization: Instead of pairwise comparisons, the method uses groups of samples. It calculates an "advantage" for each sample in a group based on the non-differentiable reward.
Bradley-Terry (BT) Model: The surrogate reward is trained to maximize the probability that a "positive group" (high-reward samples) is preferred over a "negative group" (low-reward samples).
Dynamic Weighting: Samples with extreme advantages (very good or very bad) are assigned higher weights to provide fine-grained learning signals.
Dynamic Reference Model: To prevent overfitting and instability, the reference model used for regularization is an Exponential Moving Average (EMA) of the surrogate reward model, rather than a frozen model.

C. Generator Learning

The few-step generator is optimized to maximize the surrogate reward while minimizing the reverse KL divergence from a pre-trained teacher distribution.

Objective: The loss function combines the surrogate reward maximization with a marginal-level reverse KL regularization. This ensures the generator stays close to the base distribution while adapting to the new reward signals.
Gradient Flow: The gradient is derived by substituting the parameterized surrogate reward into the RL objective, allowing the generator to be updated using the non-differentiable signals indirectly.

3. Key Contributions

First Scalable RL for Non-Differentiable Rewards in Few-Step Models: TDM-R1 is the first framework to effectively apply large-scale online non-differentiable rewards (like OCR scores or human preferences) to few-step diffusion models without requiring ground-truth images.
Deterministic Trajectory Advantage: It leverages the deterministic nature of TDM to assign accurate, low-variance rewards to intermediate denoising steps, solving the credit assignment problem inherent in few-step generation.
Decoupled Learning Framework: By separating surrogate reward learning from generator optimization, the method avoids the instability of joint training and allows for the use of group-based preference optimization (similar to GRPO in LLMs).
Dynamic Surrogate Reward: The introduction of a learnable, diffusion-parameterized surrogate reward that adapts dynamically to the generator's outputs, providing superior guidance compared to static reward models.

4. Experimental Results

The authors conducted extensive experiments on SD3.5-M and the large Z-Image (6B) model.

GenEval (Compositional Generation):
- TDM-R1 (4 NFE) achieved a score of 92%, significantly surpassing the 80-NFE base model (63%) and the commercial SOTA GPT-4o (84%).
- It outperformed previous RL methods (Flow-GRPO, DGPO) on out-of-domain metrics, whereas those methods often suffered from image quality degradation.
Visual Text Rendering (OCR):
- TDM-R1 boosted OCR accuracy from 61% to 92% on the SD3.5-M base, demonstrating superior text alignment capabilities.
Human Preference Alignment:
- Using ImageReward and HPSv3 as signals, TDM-R1 consistently outperformed both few-step and many-step baselines, showing strong alignment with human preferences.
Scalability:
- The method successfully scaled to the Z-Image model, where the 4-step TDM-R1 variant outperformed the 100-step Z-Image and the 4-step Z-Image-Turbo across all metrics.
Ablation Studies:
- Deterministic vs. Stochastic: Deterministic trajectories yielded faster convergence and better performance.
- Dynamic vs. Frozen Reward: The dynamic surrogate reward significantly outperformed using a frozen, pre-trained reward model.
- Direct RL Loss: Directly combining RL loss with distillation loss resulted in blurry images and performance collapse, validating the necessity of the surrogate reward approach.

5. Significance

Bridging the Gap: TDM-R1 bridges the gap between the efficiency of few-step generation and the flexibility of RL, enabling models to learn from complex, real-world feedback that was previously inaccessible.
Efficiency vs. Quality: It proves that few-step models (4 NFE) can not only match but surpass expensive many-step models (80-100 NFE) in terms of instruction following and visual quality when properly reinforced.
Generalizability: The paradigm is model-agnostic and scales effectively to large foundation models (6B parameters), suggesting a new standard for post-training generative AI.
Practical Impact: By enabling the use of non-differentiable rewards (like OCR or human binary feedback), TDM-R1 opens the door for rapid iteration and alignment of generative models in industrial applications where ground-truth data is scarce or expensive.