Fine-Tuning Diffusion Models via Intermediate Distribution Shaping

Imagine you have a master chef (a Diffusion Model) who is incredibly talented at cooking. They've been trained on millions of recipes and can create a delicious meal from scratch. However, sometimes the chef makes mistakes: maybe the soup is too salty, or the presentation isn't quite what the customer asked for.

In the world of AI, we call this fine-tuning. We want to teach the chef to make better meals without firing them and hiring a new one from scratch.

This paper introduces two clever new ways to teach this chef, using a concept called "Intermediate Distribution Shaping." Here is the breakdown in simple terms:

1. The Problem: The Chef's "Black Box"

Usually, to teach a chef, you might say, "If the soup tastes good, give them a gold star; if it tastes bad, give them a red card." This is called Reinforcement Learning.

But with diffusion models (the AI chef), there's a catch. The chef doesn't just spit out a finished dish instantly. They start with a bowl of random noise (like a pile of unidentifiable ingredients) and slowly, step-by-step, turn it into a soup.

The Issue: By the time the soup is finished, it's too late to know exactly which step went wrong. Also, calculating the "perfect" score for the chef is mathematically impossible (intractable) for these complex models.
The Old Way: Previous methods tried to force the chef to learn by guessing the score at every single step, which was unstable and often made the chef worse at cooking.

2. The Solution: "GRAFT" (The Rejection Sampling Chef)

The authors first propose a method called GRAFT (Generalized Rejection sAmpling Fine-Tuning).

The Analogy: Imagine the chef makes 100 bowls of soup. You taste all 100. You keep the top 10 best ones and throw away the rest. You then say to the chef, "Next time, try to make soup that looks and tastes like these 10 winners."

Why it works: Instead of trying to calculate a complex math score for every single step, you just let the chef make many attempts, pick the winners, and learn from them. It's like a "Best of N" approach.
The Magic: The paper proves that this simple "pick the winners" method is mathematically equivalent to the complex, hard-to-calculate "perfect" teaching method.

3. The Big Innovation: "P-GRAFT" (The Mid-Course Correction)

This is the paper's main breakthrough. The authors realized that waiting until the soup is completely finished to judge it is actually inefficient.

The Analogy:
Imagine the chef is painting a picture.

Old Way: Wait until the painting is 100% done. If it's bad, throw it away. If it's good, keep it.
P-GRAFT Way: Stop the chef when the painting is 75% done. Look at the canvas. If the colors are already looking promising, keep that canvas and tell the chef, "Great job so far! Now, finish the rest of the painting using your original, untrained skills."

Why is this better? (The Bias-Variance Tradeoff)

The "Noise" Problem: When the chef is just starting (0% done), the image is just random static. It's very hard to tell if the chef is doing a good job or not because the "signal" is weak.
The "Complexity" Problem: When the chef is almost done (99% done), the image is very detailed. It's easy to see if it's good, but it's also very hard for the chef to change their style at the last second without ruining the whole thing.
The Sweet Spot: By stopping at an intermediate step (say, 25% or 50% done), you get the best of both worlds. The image has enough detail to judge quality, but it's still early enough that the chef can easily learn to steer the process in the right direction.

The Result: P-GRAFT acts like a coach who steps in halfway through the game to give a quick, high-impact correction, rather than waiting until the final whistle to critique the whole match.

4. The Bonus Trick: "Inverse Noise Correction"

The second part of the paper deals with a specific type of AI model called a Flow Model. These models work like a straight line from "Noise" to "Image."

The Analogy: Imagine the chef is trying to walk from the kitchen (Noise) to the dining room (Image) in a straight line. But the floor is slippery, so they keep drifting off course.

The Old Way: Try to teach the chef to walk straighter.
The New Way (Inverse Noise Correction): Instead of teaching the chef to walk straighter, you change the starting point. You realize the chef is drifting because they are starting from the wrong spot in the kitchen. So, you move the chef to a different starting spot in the kitchen. Now, when they walk their usual "slippery" path, they accidentally end up in the perfect spot in the dining room.

This allows the model to produce much higher quality images without needing to retrain the whole chef, and it does it faster (using less computer power).

Summary of Results

The authors tested these ideas on:

Text-to-Image: Making pictures from descriptions (e.g., "a cat wearing a hat"). P-GRAFT made the images match the text much better than previous methods.
Layouts: Arranging elements on a page (like a newspaper).
Molecules: Designing chemical structures.
Unconditional Images: Just making pretty pictures.

The Takeaway:
Instead of trying to fix the AI at the very end or the very beginning, this paper suggests meeting the AI halfway. By shaping the distribution of the "middle" steps, we can teach these complex models to be smarter, more accurate, and more efficient, all without needing massive amounts of extra computing power.

It's like realizing that to fix a wobbly table, you don't need to replace the whole table or the floor; you just need to put a shim under the right leg at the right time.

1. Problem Statement

Pre-trained generative models, particularly diffusion models, often require fine-tuning to align with specific downstream tasks or correct learning errors. Standard approaches like Proximal Policy Optimization (PPO) face significant challenges in this domain:

Intractable Marginal Likelihoods: Unlike autoregressive models, diffusion models do not allow for the easy computation of marginal likelihoods ( $\bar{p}(x)$ ), which are required for exact KL regularization in PPO.
Training Instability: Ignoring the KL term leads to unstable training in large-scale settings, while using trajectory-based KL relaxations often results in subpar performance and initial value function bias.
Computational Cost: Existing rejection sampling methods (like RAFT) typically fine-tune the model over the entire denoising trajectory, which is computationally expensive and may not be optimal due to the complexity of learning score functions at different noise levels.

The paper addresses the need for a fine-tuning framework that effectively performs KL-regularized reward maximization for diffusion models without requiring intractable marginal likelihoods, while also exploring ways to correct pre-trained flow models without explicit rewards.

2. Methodology

The paper proposes a unified framework based on Generalized Rejection Sampling (GRS) and introduces two novel algorithms: P-GRAFT and Inverse Noise Correction.

A. Theoretical Foundation: GRAFT

The authors first unify existing rejection sampling methods (such as RAFT, RSO, and Best-of-N) under a Generalized Rejection Sampling (GRS) framework.

Key Insight: They prove that GRS implicitly performs KL-regularized reward maximization with reshaped rewards.
Mechanism: Instead of calculating the intractable marginal KL, GRS accepts a subset of samples based on a reward function and an acceptance probability derived from the empirical distribution of rewards. This allows diffusion models to achieve marginal KL constraints effectively.
Generalized Rejection sAmpling Fine-Tuning (GRAFT): This is the baseline algorithm where samples are generated, accepted/rejected based on rewards, and the model is fine-tuned on the accepted dataset.

B. P-GRAFT: Intermediate Distribution Shaping

The core contribution is Partial-GRAFT (P-GRAFT), which leverages the structure of diffusion models to fine-tune only up to an intermediate noise level ( $t_{NI}$ ) rather than the full trajectory.

Bias-Variance Tradeoff Justification:
- Variance: As the noise level $t$ increases (closer to pure noise), the conditional variance of the final reward given the intermediate state increases.
- Bias (Learning Difficulty): Conversely, the score function (gradient of the log-density) at later time steps (closer to noise) is simpler and closer to a Gaussian distribution, making it easier to learn (lower bias).
- Strategy: P-GRAFT fine-tunes the model from time $T$ (noise) to an intermediate time $t_{NI}$ . The model learns to shape the distribution at this intermediate stage. The remaining denoising steps ( $t_{NI} \to 0$ ) are handled by the original reference model.
Benefit: This balances the trade-off between the difficulty of learning the score function (bias) and the noise in the reward signal (variance), leading to more effective fine-tuning with fewer samples.

C. Inverse Noise Correction for Flow Models

Motivated by the bias-variance analysis, the authors propose a method to correct pre-trained Flow Models (which use ODEs) without explicit rewards.

Concept: Since the final output of a flow model is deterministically determined by the initial noise, errors in the final distribution can be traced back to errors in the initial noise distribution.
Algorithm:
1. Take a pre-trained flow model $v_\theta$ .
2. Use the Backward Euler method to reverse the ODE from a target data sample back to the "inverse noise" space.
3. Train a lightweight adapter model (Noise Corrector) to map standard Gaussian noise to this learned "inverse noise" distribution.
4. During inference, the Noise Corrector generates the corrected noise, which is then passed to the original pre-trained model.
Result: This corrects distributional shifts in the pre-trained model, improving generation quality even without a reward function.

3. Key Contributions

Unified Framework (GRAFT): Established that rejection sampling strategies for diffusion models are equivalent to KL-regularized reward maximization with reshaped rewards, bypassing the need for intractable marginal likelihoods.
P-GRAFT: Introduced a principled method for intermediate distribution shaping. It demonstrated that fine-tuning only the early denoising steps (high noise) yields superior performance due to a favorable bias-variance tradeoff.
Inverse Noise Correction: Proposed a novel, parameter-efficient method to correct pre-trained flow models by learning the inverse noise distribution, improving quality without requiring explicit reward signals.
Theoretical Analysis: Provided mathematical proofs (via bias-variance tradeoff and Bakry-Emery theory) explaining why intermediate shaping is effective and why the score function is easier to learn at higher noise levels.

4. Experimental Results

The methods were evaluated across text-to-image (T2I), layout generation, molecule generation, and unconditional image generation.

Text-to-Image (Stable Diffusion v2):
- Performance: P-GRAFT significantly outperformed policy gradient methods (DDPO) and the base model on VQAScore (a prompt-image alignment metric).
- Metrics: On GenAI-Bench, P-GRAFT achieved a score of 71.94 (vs. 66.87 for base SDv2 and 65.70 for DDPO), representing an 8.81% relative improvement over the base model.
- Efficiency: P-GRAFT achieved better results with significantly fewer gradient calls and training samples compared to DDPO.
Layout & Molecule Generation:
- Applied to discrete-continuous diffusion models (IGD).
- P-GRAFT improved alignment metrics for layouts and stability for molecules.
- Crucially, P-GRAFT maintained diversity (uniqueness) in molecule generation, whereas standard GRAFT without de-duplication suffered from mode collapse.
Inverse Noise Correction (Flow Models):
- Tested on CelebA-HQ and LSUN-Church.
- Quality: Significantly improved FID scores compared to the pre-trained baseline.
- Efficiency: Achieved better FID with lower FLOPs/image. For example, a setup with 100 steps for the noise corrector + 100 steps for the pre-trained model outperformed the pre-trained model running 1000 steps, with a 4x reduction in parameters for the corrector.

5. Significance

Overcoming KL Intractability: The paper provides a practical solution to the long-standing problem of applying KL-regularized RL to diffusion models, which was previously hindered by intractable likelihoods.
Efficiency: By focusing fine-tuning on intermediate noise levels, P-GRAFT reduces computational costs and training instability, making high-quality alignment more accessible.
Reward-Free Correction: The Inverse Noise Correction technique offers a new paradigm for improving pre-trained models (especially flow models) without the need for expensive reward modeling or human feedback.
Generalizability: The framework is shown to work across continuous (images), discrete-continuous (molecules/layouts), and flow-based models, suggesting broad applicability in generative AI.

In summary, this work shifts the paradigm of diffusion model fine-tuning from full-trajectory optimization to intermediate distribution shaping, leveraging theoretical insights about score function complexity to achieve state-of-the-art alignment and quality improvements with greater efficiency.