Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Imagine you have a master chef (the Diffusion Model) who can cook incredibly delicious, realistic-looking meals from scratch. This chef has been trained on millions of recipes and knows exactly how to make a perfect steak or a beautiful cake.

However, you have a specific goal: you want the chef to make a dish that scores a 10/10 on a specific "Taste Test" (the Reward), like "most colorful" or "most appetizing."

The Problem: The "Over-Optimized" Chef

If you just tell the chef, "Make me the highest-scoring dish possible," and let them try to game the system, something weird happens.

Instead of making a beautiful, delicious steak, the chef might start serving you a plate of neon-colored, glowing rocks.

Why? Because the rocks technically score a 10/10 on "colorfulness," but they aren't food anymore.
The Result: The chef loses their ability to make real food. The dishes become weird, repetitive, and lose their natural charm. In the paper's world, this is called "Reward Over-Optimization" or "Semantic Collapse." The model chases the score so hard it forgets how to be human.

The Solution: SQDF (Soft Q Diffusion Finetuning)

The authors of this paper propose a new way to teach the chef, called SQDF. Think of it as a smart, gentle coach who guides the chef without forcing them to break the rules of reality.

Here is how SQDF works, broken down into three simple concepts:

1. The "One-Step Crystal Ball" (The Soft Q-Function)

Usually, to know if a dish will taste good, you have to cook the whole thing, taste it, and then try to figure out which ingredient you added wrong. This is slow and confusing.

SQDF uses a special trick called a Consistency Model (think of it as a Crystal Ball).

Instead of cooking the whole meal, the chef takes a half-cooked pot (a noisy image) and uses the Crystal Ball to instantly guess what the final dish would look like if they finished it right now.
The coach looks at this "guess," checks the Taste Score, and says, "Hey, if you tweak this one ingredient right now, the final dish will be better."
The Magic: This allows the chef to learn instantly without having to cook the whole meal 50 times to get feedback. It's like getting a cheat sheet that tells you exactly how to improve your next move.

2. The "Discounted Score" (The Discount Factor)

Imagine the cooking process has 50 steps.

Step 1: You add salt to a raw, unrecognizable blob.
Step 50: You plate the final steak.

If you tell the chef, "Every step matters equally," the chef might get confused about Step 1. "Does adding salt to the raw blob really matter?"

SQDF introduces a Discount Factor. It tells the chef:

"The steps you take right now (near the end) matter the most for the final taste. The steps you took way back at the beginning matter a little less."

This stops the chef from wasting energy trying to perfect the very first step of the process, focusing their effort where it actually counts. It's like a coach saying, "Don't worry about the warm-up; focus on the final sprint."

3. The "Tasting Menu" (The Replay Buffer)

Sometimes, a chef accidentally makes a masterpiece by mistake. If you only let them cook that one specific dish over and over, they might forget how to make anything else.

SQDF uses a Replay Buffer, which is like a Tasting Menu of past successes.

The coach saves the best dishes the chef has ever made (high scores) and the most different dishes (high diversity).
When training, the chef practices on this menu, mixing the "best" with the "most unique."
The Result: The chef gets better at scoring high points but doesn't forget how to make a variety of different, natural-looking dishes. They don't just become a "Rock Maker"; they become a "Master of Colorful, Delicious Food."

The Big Picture

In the real world of AI, this method helps computers generate images (like pictures of cats or landscapes) that:

Look exactly what you asked for (High Alignment).
Look beautiful and natural (High Quality).
Don't all look the same (High Diversity).

Without SQDF, AI models often get "greedy" and start making weird, repetitive garbage just to get a high score. With SQDF, the AI learns to be a smart optimizer that improves its skills without losing its soul.

In short: SQDF is the art of teaching an AI to be excellent at a specific task without turning it into a robot that only knows how to do that one thing in a weird, broken way. It keeps the AI creative, diverse, and human-like.

1. Problem Statement

Diffusion models have achieved state-of-the-art performance in generative tasks (e.g., text-to-image synthesis). However, aligning these pre-trained models with specific downstream objectives (such as aesthetic quality or human preference) remains challenging.

Reward Over-Optimization: Existing fine-tuning methods, including Reinforcement Learning (RL) approaches (e.g., PPO/DDPO) and direct backpropagation methods (e.g., DRaFT, ReFL), often suffer from "reward over-optimization." This leads to semantic collapse (loss of prompt alignment) and diversity collapse (samples converging to similar, unnatural patterns) as the model chases high reward scores.
Limitations of Current Solutions:
- KL-Regularized RL: While adding a KL-divergence penalty helps, many existing approaches require training a separate value function (Q-network), which is notoriously unstable in diffusion models.
- Gradient Estimation: Methods that rely on Monte Carlo estimators suffer from high variance, while those using direct backpropagation through the entire denoising chain can be computationally expensive and unstable.
- Credit Assignment: Standard formulations often treat all denoising steps equally, ignoring that early steps have less influence on the final sample quality compared to later steps.

2. Methodology: SQDF

The authors propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL framework that optimizes diffusion models using a reparameterized policy gradient guided by a training-free, differentiable estimation of the soft Q-function.

Core Mechanism

SQDF formulates the diffusion reverse process as a Markov Decision Process (MDP). Instead of training a separate Q-network, it approximates the Soft Q-function ( $Q^*_{soft}$ ) using a single-step posterior mean approximation derived from Tweedie's formula.

Approximation: $Q^*_{soft}(x_t, x_{t-1}) \approx r(\hat{x}_0(x_{t-1}))$ , where $\hat{x}_0$ is the estimated clean image.
Reparameterized Policy Gradient: By using the reparameterization trick ( $x_{t-1} = \mu_\theta(x_t, t) + \sigma_t \epsilon$ ), the gradient of the reward with respect to the model parameters can be computed directly and efficiently, avoiding high-variance Monte Carlo estimators.
Objective: The loss function minimizes the negative soft Q-value while penalizing deviation from the pre-trained model (reference policy $p'$ ) via KL-divergence:
$\mathcal{L}(\theta) \approx \mathbb{E} \left[ -r(\hat{x}_0(x_{t-1})) + \alpha D_{KL}(p_\theta || p') \right]$

Three Key Innovations

To stabilize training and improve performance, SQDF introduces three specific components:

Discount Factor ( $\gamma$ ) for Credit Assignment:
- In standard diffusion MDPs, all steps are often weighted equally ( $\gamma=1$ ). SQDF introduces a discount factor $\gamma \in [0, 1)$ to exponentially down-weight early denoising steps.
- Rationale: Early steps have a lower signal-to-noise ratio and less influence on the final sample quality. Discounting them reduces the impact of approximation errors in the early stages, leading to more stable credit assignment.
Consistency Models for Q-Function Estimation:
- Standard Tweedie's formula (used for $\hat{x}_0$ estimation) is highly inaccurate at high noise levels (early timesteps), leading to poor Q-function estimates.
- Solution: SQDF integrates a Consistency Model ( $f_\psi$ ) to predict the clean sample $\hat{x}_0$ from noisy inputs. Consistency models are trained to map any noisy state directly to the clean data distribution, providing a more accurate and uniform posterior mean approximation across all timesteps compared to standard diffusion sampling or Tweedie's formula.
Off-Policy Replay Buffer:
- Unlike on-policy methods, SQDF utilizes an experience replay buffer to store past trajectories.
- Benefit: This allows the model to reuse rare, high-reward, and diverse samples. It helps manage the reward-diversity trade-off by preventing the model from forgetting diverse modes (catastrophic forgetting) and improving mode coverage.

3. Key Contributions

Training-Free Soft Q-Approximation: SQDF eliminates the need for unstable value function training by using a differentiable, one-step posterior mean approximation of the soft Q-function.
Stabilization Techniques: The introduction of a discount factor and consistency models significantly improves the reliability of the reward gradient signal, particularly in early denoising steps.
Off-Policy Efficiency: The use of a replay buffer enables efficient sample usage and better diversity preservation compared to on-policy baselines.
Comprehensive Evaluation: The method is validated on both differentiable reward tasks (Aesthetic score, HPS) and online black-box optimization scenarios.

4. Experimental Results

The authors evaluated SQDF on Stable Diffusion 1.5 and Stable Diffusion XL against baselines like DDPO, DRaFT, ReFL, and SEIKO.

Text-to-Image Fine-Tuning:
- SQDF achieved superior target rewards (Aesthetic and HPS scores) while maintaining significantly higher alignment and diversity scores compared to baselines.
- Baselines like DRaFT and ReFL achieved high rewards but suffered from severe semantic and diversity collapse. DDPO failed to reach comparable reward levels.
- SQDF occupied the Pareto frontier, offering the best trade-off between reward maximization and sample quality.
Online Black-Box Optimization:
- In a setting with limited oracle queries (where the true reward is a black box), SQDF demonstrated high sample efficiency.
- It maintained naturalness and diversity where other methods (like SEIKO and PPO+KL) degraded significantly, proving robustness to imperfect reward proxies.
Ablation Studies:
- Removing the discount factor led to slower convergence and lower alignment/diversity.
- Removing the consistency model reduced target reward performance due to inaccurate Q-estimates.
- Removing the replay buffer resulted in lower diversity scores.

5. Significance

SQDF represents a significant advancement in the alignment of diffusion models. By bridging the gap between reinforcement learning theory and diffusion model mechanics, it offers a stable, sample-efficient, and training-free approach to fine-tuning.

It solves the critical issue of reward over-optimization without sacrificing sample diversity or naturalness.
It provides a practical framework for deploying diffusion models in real-world scenarios where reward functions are complex, black-box, or expensive to query.
The methodology is generalizable to various diffusion backbones (validated on SD 1.5 and SDXL) and can be extended to other generative tasks beyond image synthesis.

Code Availability: The authors have released their code at https://github.com/Shin-woocheol/SQDF.