Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

Imagine you have a incredibly talented artist who has spent years painting millions of pictures from the internet. This artist, let's call them "The Base Model," is amazing at creating images. They can draw a cat, a sunset, or a spaceship. However, because they learned from the whole internet, their taste is a bit... chaotic. Sometimes they draw a cat with six legs, or a sunset that looks like a bowl of soup. They don't quite understand what humans actually prefer.

To fix this, we need to teach the artist what we like. This paper introduces a new, smarter way to do that teaching, comparing it to the old, clunky methods.

The Problem: The "Over-Correcting" Student

The old way to teach the artist (called DPO) is like a strict teacher who says: "Show me a picture of a cat you like (Positive) and a picture of a cat you hate (Negative). Now, change your brain to make the 'like' picture and delete the 'hate' picture."

The problem? If you do this too much, the student gets confused. They start memorizing the specific examples instead of learning the concept.

The Analogy: Imagine a student studying for a math test by only memorizing the answers to three specific practice questions. When they see a slightly different question on the real test, they fail because they didn't learn the logic, they just memorized the answers.
The Result: The artist starts making weird, broken images that look nothing like a real cat, just because they are trying too hard to avoid the "hate" examples. This is called overfitting.

The Solution: The "Guide" vs. The "Artist"

The authors of this paper had a brilliant idea. Instead of forcing the artist to completely rewrite their brain, what if we just gave them a guide during the painting process?

They call their method PGD (Preference-Guided Diffusion).

The Creative Metaphor: The Sculptor and the GPS

Imagine the Base Model is a Sculptor chiseling a block of marble. They know how to chip away stone, but they don't know exactly what shape you want.

The Old Way (DPO): You tell the sculptor, "Forget everything you know about stone. Only carve the shape I want." The sculptor gets nervous, forgets how to hold the chisel, and ends up making a weird lump.
The New Way (PGD): You keep the sculptor's original skills (the "Base Model"). But, you hire a GPS Guide (the "Preference Model").
- The Sculptor starts chiseling based on their natural talent.
- The GPS Guide whispers: "Hey, you're drifting left! The human wants a nose there, not an ear. Pull it back a bit."
- The Sculptor listens, adjusts the chisel, and keeps going.

The magic is that the Sculptor still remembers how to be a sculptor (preserving quality and diversity), but the GPS ensures the final statue looks exactly like the human's vision.

The "Contrastive" Upgrade (cPGD)

The paper goes one step further with cPGD.

Imagine the GPS Guide isn't just one person, but a Team of Two:

The "Yes" Coach: Someone who only looks at pictures humans loved.
The "No" Coach: Someone who only looks at pictures humans hated.

At every step of the painting, the system asks: "What would the 'Yes' Coach do? What would the 'No' Coach do?"
Then, it calculates the difference: "Do what the 'Yes' Coach says, but subtract what the 'No' Coach says."

Analogy: It's like navigating a maze. Instead of just being told "Go Left," you are told "Go Left (because that's the exit) AND avoid going Right (because that's a dead end)." This creates a much sharper, clearer path to the goal.

Why is this better?

No "Brain Damage": Because the original artist (Base Model) isn't being forced to forget their skills, they don't suffer from "catastrophic forgetting." They still make high-quality, diverse images.
Plug-and-Play: You can train this "GPS Guide" separately. Once it's ready, you can plug it into any version of the artist. You don't have to retrain the whole artist from scratch.
Safety: If the GPS gets too aggressive (too high a "guidance weight"), the image might get weird. But the paper shows that with the right settings, you get the perfect balance: a beautiful image that follows your instructions perfectly.

The Bottom Line

This paper is about teaching AI art tools to listen better without breaking them.

Instead of forcing the AI to completely change its personality to please us (which makes it act weird), we simply give it a real-time coach that nudges it in the right direction while it works. The result is art that is not only beautiful and diverse but also exactly what the human asked for.

1. Problem Statement

Aligning large-scale text-to-image (T2I) diffusion models with nuanced human preferences is a critical challenge. While Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF) due to its simplicity, it suffers from significant limitations when applied to diffusion models:

Overfitting and Mode Collapse: DPO often overfits to the preference dataset, leading to a loss of diversity and "mode collapse" where the model generates repetitive or degenerate samples.
Generalization Gap: DPO-tuned models struggle to generalize to out-of-distribution prompts and can exhibit catastrophic forgetting of the base model's prior capabilities.
Training Instability: The DPO objective, which treats alignment as a binary classification problem on preference pairs, can lead to unstable training dynamics, particularly when the preference dataset is small or noisy.

The authors hypothesize that the root cause of these issues is the reliance on full fine-tuning of the base model, which disrupts the underlying data distribution.

2. Methodology

The paper proposes a paradigm shift: instead of viewing alignment as a retraining problem, it frames it as an inference-time guidance problem, inspired by Classifier-Free Guidance (CFG).

Core Concept: Preference-Guided Diffusion (PGD)

The authors reinterpret the alignment process through the lens of CFG. In standard CFG, sampling is guided by a linear combination of unconditional and conditional predictions:
$\hat{\epsilon} = \epsilon_u + w(\epsilon_c - \epsilon_u)$
In PGD, the authors treat the base model as the unconditional prior ( $\pi_{ref}$ ) and a lightly fine-tuned model (trained on preference data) as the conditional signal ( $\pi_{DPO}$ ). The alignment is achieved at inference time by:
$\nabla \log \pi_{PGD}(x) = \nabla \log \pi_{ref}(x) + w \left( \nabla \log \pi_{DPO}(x) - \nabla \log \pi_{ref}(x) \right)$

Key Insight: The "control signal" does not need to be a fully converged, overfitted model. It only needs a few training steps. The guidance weight $w$ amplifies the preference signal while the base model preserves the prior distribution, preventing mode collapse.

Extension: Contrastive PGD (cPGD)

To further enhance generalization and decouple the learning of positive and negative preferences, the authors propose cPGD. Instead of training a single model on preference pairs (which requires complex loss balancing), they train two independent models:

Positive Model ( $\theta_+$ ): Fine-tuned only on positive (preferred) samples using standard diffusion loss.
Negative Model ( $\theta_-$ ): Fine-tuned only on negative (dispreferred) samples using standard diffusion loss.

At inference, the guidance vector is formed by the difference between these two models:
$\nabla \log \pi_{cPGD}(x) = \nabla \log \pi_{ref}(x) + w \left( \nabla \log \pi(x; \theta_+) - \nabla \log \pi(x; \theta_-) \right)$

Theoretical Justification: The authors show that cPGD is equivalent to dynamically reweighting the DPO loss gradients. By separating the learning of "attracting" (positive) and "repelling" (negative) forces, cPGD avoids the unconstrained likelihood shrinkage often seen in DPO, leading to more stable training.

3. Key Contributions

Reframing Alignment as Inference: The paper establishes that diffusion model alignment can be treated as a special case of CFG-style inference, eliminating the need for full model retraining to achieve preference alignment.
PGD and cPGD Algorithms:
- PGD: A simple method using a lightly fine-tuned model as a guidance signal.
- cPGD: A contrastive approach using two independently trained models (positive/negative) to form a robust guidance vector.
Pareto Improvements: The methods achieve simultaneous improvements in reward scores (alignment), FID (fidelity/prior preservation), and diversity, a trade-off that DPO typically fails to balance.
Plug-and-Play Capability: The trained guidance modules are transferable. They can be applied to different base models (e.g., applying an SDXL-trained module to a KOALA model) provided they share the same latent space, without retraining the base model.
Distillation: The authors demonstrate that the multi-model inference process can be distilled into a single checkpoint, reducing inference latency while retaining most performance gains.

4. Experimental Results

The methods were evaluated on Stable Diffusion 1.5 (SD1.5) and Stable Diffusion XL (SDXL) using Pick-a-Pic v2 and HPDv3 datasets.

Quantitative Performance:
- Win Rates: PGD and cPGD consistently outperformed baselines (DPO, MaPO, NPO, KTO, SFT) across all metrics. On SDXL with Pick-a-Pic v2, cPGD achieved an average win rate of 70.8% against the base model, significantly higher than DPO's 66.3%.
- Metrics: The methods showed superior performance in PickScore, HPSv2/v3, and ImageReward. While Aesthetic scores were sometimes slightly lower (due to the focus on text-image alignment rather than pure aesthetics), the overall alignment with human preference was markedly improved.
- Robustness: The methods performed well even when trained on high-variance datasets (Full HPDv3) and showed better generalization on out-of-distribution prompts (Parti-Prompts).
Qualitative Analysis:
- Visual comparisons showed that DPO often suffered from mode collapse or artifacts, whereas PGD/cPGD retained the structural integrity of the base model while adhering strictly to the prompt.
- Human Preference Study: In a blind human evaluation (1,848 votes), PGD was selected 45.5% of the time, significantly outperforming DPO (29.5%) and the raw base model (18.9%).
Ablation Studies:
- Guidance Weight ( $w$ ): Performance peaks at moderate weights (around 6). Too low yields base-like results; too high causes chaotic predictions.
- Training Steps: cPGD models trained with fewer steps (500) often performed better than heavily overfitted DPO models, confirming the benefit of early stopping and the CFG perspective.
- Partial Guidance: Applying guidance only to the first 30-40 steps of the 50-step diffusion process recovered ~95% of the reward gain with significantly reduced compute cost.

5. Significance

This paper offers a fundamental shift in how preference alignment is approached for diffusion models. By moving away from the "train-to-convergence" paradigm of DPO and adopting an inference-time guidance strategy, the authors solve the critical issues of overfitting and diversity loss.

Practical Impact: The approach is computationally efficient (requires only light fine-tuning or separate positive/negative training) and highly flexible (plug-and-play across architectures).
Theoretical Insight: It bridges the gap between preference optimization and classifier-free guidance, suggesting that the "reward" in diffusion models can be effectively modeled as a difference in score functions rather than a complex optimization landscape.
Future Direction: The work opens avenues for modular alignment, where preference modules can be swapped or combined without retraining the massive base generative models, making high-quality, aligned T2I generation more accessible and robust.