Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

Imagine you are teaching a talented artist (the Diffusion Model) to paint beautiful pictures based on your instructions. At first, the artist just tries to make things look realistic. But you want them to make things that humans love—pictures that are aesthetically pleasing, follow specific rules, or match human preferences.

To do this, you hire a Judge (the Reward Model) to give the artist a score after every painting. The artist then tries to paint more things that get high scores from the Judge.

The problem? The artist is too smart and too eager. If you just tell them, "Get the highest score possible," they might start cheating. They might learn to paint weird, nonsensical patterns that trick the Judge into giving a 10/10, even though the picture looks terrible to a human. This is called "Reward Overoptimization" (or "Reward Hacking"). It's like a student memorizing the answer key instead of actually learning the subject; they get the grade, but they don't know the material.

This paper introduces a new way to train the artist to avoid cheating, using two main ideas: Timing and Brain Resets.

1. The Problem with "One-Stop" Judging (Inductive Bias)

The Old Way:
Imagine the artist paints a picture step-by-step, starting with a blurry mess and slowly adding details until it's clear.
In the old methods, the Judge only looks at the final picture and gives a score. The artist doesn't know how they got there, only that the end result was good. So, the artist might take risky, weird shortcuts to get that final score, ignoring the quality of the steps in between.

The New Way (TDPO):
The authors say, "Let's judge the artist at every single step of the painting process!"
Instead of waiting for the final masterpiece, the Judge gives a tiny score for every little brushstroke (every "denoising step").

The Analogy: It's like a music teacher listening to a student practice scales while they are playing, rather than just waiting for the final concert. If the student plays a wrong note early on, the teacher corrects them immediately.
The Result: The artist learns to build the picture correctly from start to finish, not just hack the final result. This makes the training much more efficient and less likely to result in "cheating."

2. The Problem with "Stubborn Neurons" (Primacy Bias)

The Discovery:
Inside the Judge's brain (the neural network), there are billions of tiny switches called neurons. Some are always "on" (active), and some are "off" (dormant).
The researchers found something surprising:

Dormant Neurons (The Sleepers): These are actually good. They act like a safety net, preventing the Judge from becoming too obsessed with one specific trick. They keep the Judge flexible.
Active Neurons (The Overachievers): These are the ones causing the trouble. They get stuck on the first few tricks they learned (the "Primacy Bias"). They keep shouting, "Do it this way! It worked before!" even when it's leading the artist to cheat.

The Solution (TDPO-R):
The authors propose a strategy called TDPO-R.

The Analogy: Imagine the Judge is a coach who gets too fixated on one specific play. Every few weeks, the coach takes a break and resets the players who are shouting the loudest (the active neurons), forcing them to try new strategies.
The Catch: You don't reset the sleepers (dormant neurons). You leave them alone because they are the ones keeping the team balanced.
The Result: By periodically "waking up" the overactive neurons and forcing them to learn fresh, the system stops overfitting to the reward and keeps generating high-quality, diverse images.

Summary of the Magic

The paper combines these two ideas into a new training algorithm:

Time it right: Judge the artist at every step of the process, not just the end.
Reset the brain: Periodically reset the "loud" neurons in the Judge to stop them from getting stuck on old tricks, while keeping the "quiet" neurons to act as a safety brake.

The Outcome:
The artist learns to create beautiful, diverse, and high-quality images that actually match human taste, without falling into the trap of trying to "game the system." It's a smarter, more stable way to teach AI how to create art that humans actually enjoy.

1. Problem Statement: Reward Overoptimization in Diffusion Models

Diffusion models have achieved state-of-the-art performance in generative tasks (e.g., text-to-image). To align these models with human preferences or specific aesthetic goals, researchers use reward-driven fine-tuning (via Reinforcement Learning or supervised learning). However, a critical challenge known as Reward Overoptimization (or "reward hacking") persists.

The Phenomenon: When models are optimized excessively against a learned or handcrafted reward function, they tend to overfit to the specific metrics of that reward. This leads to a degradation in the actual quality of the generated images (fidelity loss) and poor generalization to other, unseen reward functions.
The Gap: Existing methods often treat the diffusion process as a black box, focusing solely on the final output image ( $x_0$ ) for reward calculation. This ignores the sequential, multi-step nature of the denoising process, creating a mismatch between the optimization strategy and the model's inherent structure.
The Goal: The authors aim to identify the underlying causes of this overoptimization and propose a method that balances sample efficiency with robust generalization.

2. Methodology: Inductive and Primacy Biases

The paper approaches the problem through two specific lenses: Inductive Bias (structural mismatch) and Primacy Bias (neural network dynamics).

A. Addressing Inductive Bias: Temporal Diffusion Policy Optimization (TDPO)

Current methods compute rewards only on the final image, ignoring intermediate steps. The authors argue this violates the temporal inductive bias inherent in diffusion models, where the generation process is a sequence of denoising steps.

MDP Reformulation: They reframe the denoising process as a Multi-Step Markov Decision Process (MDP) where every timestep $t$ (from $T$ down to 0) is a state transition.
Temporal Rewards: Instead of a single reward $R(x_0)$ , they introduce temporal rewards $T(x_t, c)$ for intermediate noisy images.
Temporal Critic: Since standard reward models are trained on clean images, the authors train a lightweight Temporal Critic ( $T_\phi$ $T_{ϕ}$ ) to estimate rewards for intermediate noisy states.
- They approximate the temporal reward using a residual learning approach: $T(x_t, c) \approx R(x_0, c) - R_\phi(x_t, c)$ , where $R$ is the pre-trained reward model and $R_\phi$ is a learned residual.
- Encoder Alignment: To ensure efficiency and feature consistency, the Temporal Critic reuses the encoder of the pre-trained reward model, feeding it intermediate latent features.
Per-Timestep Updates: Unlike standard RL which updates per batch, TDPO performs per-timestep gradient updates. This aligns the update frequency with the temporal granularity of the denoising process, improving sample efficiency and stability.

B. Addressing Primacy Bias: Active Neuron Reset (TDPO-R)

The authors investigate Primacy Bias—the tendency of deep RL agents to overfit early training experiences. They analyze the internal states of the Temporal Critic's neurons.

Neuron State Analysis: Neurons are categorized as Active (high activation scores) or Dormant (low activation scores).
Counter-Intuitive Finding:
- Standard literature suggests resetting dormant neurons improves plasticity.
- Discovery: In the context of reward overoptimization, dormant neurons act as an adaptive regularization that prevents overfitting. Resetting them worsens overoptimization.
- Active Neurons: Conversely, active neurons are the primary carriers of primacy bias. They overfit to early reward signals, leading to reward hacking.
TDPO-R Algorithm: To mitigate this, the authors propose TDPO-R, which periodically resets the weights of active neurons in the Temporal Critic. This forces the critic to relearn regularization patterns without catastrophic forgetting, effectively breaking the cycle of overfitting to early reward signals.

3. Key Contributions

Novel Perspective: First work to analyze reward overoptimization in diffusion models through the dual lenses of inductive bias (temporal mismatch) and primacy bias (neuron state dynamics).
TDPO Framework: Introduces a novel RL framework that exploits temporal inductive bias by assigning rewards to intermediate denoising steps and utilizing a Temporal Critic with encoder alignment.
TDPO-R & Neuron Reset: Identifies that active neurons in the critic drive overoptimization (contrary to the belief that dormant neurons are the problem) and proposes a periodic reset strategy for these active neurons to mitigate bias.
New Evaluation Metric: Proposes Cross-Reward Generalization as a quantitative proxy for measuring reward overoptimization. A model that overoptimizes will perform well on the training reward but poorly on out-of-domain rewards.

4. Empirical Results

The authors evaluated their methods using Stable Diffusion v1.4 with various reward functions (Aesthetic Score, PickScore, HPSv2, ImageReward).

Sample Efficiency: TDPO and TDPO-R significantly outperform baselines (DDPO, AlignProp) in terms of reward queries required to reach high performance. The per-timestep update strategy accelerates convergence.
Mitigating Overoptimization:
- Cross-Reward Generalization: TDPO-R maintains high performance on out-of-domain rewards (e.g., training on Aesthetic Score but evaluating on ImageReward), whereas baselines like DDPO show significant performance drops as training progresses.
- Qualitative Results: Images generated by TDPO-R exhibit higher fidelity, better diversity, and fewer artifacts (e.g., saturation in style/background) compared to over-optimized baselines.
Ablation Studies:
- Resetting dormant neurons worsened overoptimization.
- Resetting active neurons significantly improved generalization.
- Resetting all neurons caused catastrophic forgetting.
- The neuron reset strategy outperformed traditional KL regularization in preventing overoptimization.

5. Significance and Impact

Theoretical Insight: The paper provides a fundamental understanding of why diffusion models overfit, linking it to the mismatch between reward structures and the denoising process, as well as specific neural dynamics (primacy bias in active neurons).
Practical Utility: TDPO-R offers a scalable, training-efficient solution for aligning diffusion models with human preferences without sacrificing image quality or generalization.
Broader Implications: The findings regarding neuron states (dormant vs. active) and their role in regularization vs. overfitting may offer new avenues for research in general Deep Reinforcement Learning and neural network plasticity.

In summary, TDPO-R successfully bridges the gap between diffusion models and human preferences by respecting the temporal nature of generation and dynamically managing the internal plasticity of the reward critic, resulting in more robust and reliable generative AI.

Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

1. The Problem with "One-Stop" Judging (Inductive Bias)

2. The Problem with "Stubborn Neurons" (Primacy Bias)

Summary of the Magic

1. Problem Statement: Reward Overoptimization in Diffusion Models

2. Methodology: Inductive and Primacy Biases

A. Addressing Inductive Bias: Temporal Diffusion Policy Optimization (TDPO)

B. Addressing Primacy Bias: Active Neuron Reset (TDPO-R)

3. Key Contributions

4. Empirical Results

5. Significance and Impact

More like this

Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control

Pure and Physics-Guided Deep Learning Solutions for Spatio-Temporal Groundwater Level Prediction at Arbitrary Locations

MAGNET: Autonomous Expert Model Generation via Decentralized Autoresearch and BitNet Training

A Compression Perspective on Simplicity Bias

Incorporating contextual information into KGWAS for interpretable GWAS discovery