What Makes a Reward Model a Good Teacher? An Optimization Perspective

Imagine you are trying to teach a very talented but naive student (the Language Model) how to write great stories. You can't write the story for them, so you hire a Teacher (the Reward Model) to grade their work and tell them what to improve.

For years, the industry has believed that the best teacher is simply the one who is most accurate. If the teacher correctly identifies that Story A is better than Story B 99% of the time, we assume they are the perfect guide.

This paper, however, argues that accuracy isn't everything. In fact, a teacher can be perfectly accurate but still be a terrible guide because they are too "boring" or "flat" in their feedback.

Here is the breakdown of the paper's findings using simple analogies:

1. The Problem: The "Flat Landscape" Trap

Imagine the student is trying to climb a mountain to reach the peak of "Great Writing."

The Reward Model is the map and the compass.
Accuracy is how correctly the map points to the peak.
Reward Variance is how much the map tells the student to move.

The Paper's Discovery:
If your teacher is accurate but gives low variance feedback, it's like a teacher who says, "Story A is slightly better than Story B, and Story C is also slightly better than Story D." The differences are so tiny that the student feels like they are walking on a flat, featureless plain. They don't know which direction to run because every step feels the same. They get stuck, and progress is agonizingly slow.

Conversely, a teacher who is slightly less accurate but gives high variance feedback is like a teacher who shouts, "Story A is AMAZING! Story B is TERRIBLE!" Even if they get a few details wrong, the huge difference in scores gives the student a clear, strong signal to run in the right direction. They climb the mountain much faster.

2. The Core Lesson: You Need "Spicy" Feedback

The paper proves mathematically that for the student to learn quickly, the teacher's scores need to have enough "spice" (variance).

Low Variance (Flat): The teacher gives scores like 0.51, 0.52, 0.53. The student is confused. Optimization is slow.
High Variance (Spicy): The teacher gives scores like 0.1, 0.5, 0.9. The student knows exactly what to aim for. Optimization is fast.

The Shocking Finding:
A teacher who is 100% accurate but gives flat scores (low variance) will actually produce a worse student than a teacher who is only 70% accurate but gives spicy, high-variance scores. The "perfect" teacher fails because the student can't figure out how to move.

3. The "One Size Does Not Fit All" Rule

The paper also discovered that a teacher who is great for one student might be terrible for another.

Student A might be a beginner who needs a teacher who screams "Good job!" and "Bad job!" (High variance) to get motivated.
Student B might be an advanced student who needs a teacher who gives very specific, nuanced, and subtle feedback (which might look like low variance to an outsider).

If you take a teacher who works wonders for Student A and hand them to Student B, the student might get confused and stop learning. The "best" teacher depends entirely on the specific student they are teaching.

4. Real-World Experiments

The researchers tested this with computer models (up to 8 billion parameters).

They created "perfect" teachers that were accurate but gave flat scores.
They created "imperfect" teachers that gave big, clear score differences.
Result: The "imperfect" teachers with high variance helped the language models learn much faster and better than the "perfect" flat teachers. In some cases, using a proxy teacher (the imperfect one) was even better than letting the model try to optimize based on the "ground truth" directly, because the proxy teacher provided a clearer path up the mountain.

Summary: What Makes a Good Teacher?

To summarize the paper in one sentence: A good teacher for AI isn't just the one who is right; it's the one who gives feedback loud and clear enough to be heard.

If you are building AI systems, don't just look for the reward model with the highest accuracy score. Look for the one that creates a steep, clear path (high variance) for the AI to follow, even if it means the teacher isn't 100% perfect. Otherwise, your AI might get stuck on a flat plain, knowing what's right but unable to move toward it.

1. Problem Statement

Reinforcement Learning from Human Feedback (RLHF) is the standard pipeline for aligning Large Language Models (LLMs) with human preferences. This process relies heavily on a Reward Model (RM) to guide the policy ( $\pi_\theta$ ) via policy gradient methods (e.g., PPO, RLOO, GRPO).

The Core Issue:
Current evaluation of Reward Models focuses almost exclusively on accuracy (the ability to correctly rank output pairs according to human preference). However, empirical evidence suggests that higher accuracy does not necessarily lead to better final language models. The paper asks: What makes a reward model an effective teacher for RLHF if accuracy alone is insufficient?

The authors argue that the standard evaluation metrics fail to capture the optimization landscape created by the reward model. Specifically, they investigate how the properties of the reward model affect the convergence speed and efficiency of the policy gradient optimization.

2. Methodology

The authors approach the problem from a theoretical optimization perspective, analyzing the dynamics of policy gradient ascent under the RLHF objective.

Theoretical Framework

Objective: Maximize the expected proxy reward $r_{RM}$ while regularizing the KL divergence from a reference policy $\pi_{ref}$ .
$\phi_{RLHF}(\theta) := \mathbb{E}_{x \sim S} [\mathbb{E}_{y \sim \pi_\theta(\cdot|x)} [r_{RM}(x, y)] - \lambda \cdot KL(\pi_\theta || \pi_{ref})]$
Key Metric: Reward Variance. Defined as the variance of the reward assigned to outputs sampled from the current policy:
$\text{Var}_{y \sim \pi_\theta(\cdot|x)}[r_{RM}(x, y)]$
Analysis Tools:
- Gradient Flow: The authors analyze the continuous-time limit of gradient descent (gradient flow) to derive theoretical bounds on the time required to increase the expected reward.
- Policy Parameterizations: They consider both general autoregressive policies (neural networks) and tabular policies (simplified setting where each output has a trainable logit) to prove separation results.
- Gradient Vanishing: They leverage the connection between low reward variance and vanishing gradients. If the reward model assigns similar values to probable outputs, the gradient of the objective function vanishes, stalling optimization.

Experimental Setup

To validate their theory, the authors conducted extensive experiments using:

Models: Language models up to 8B parameters (Pythia, Llama-3.2 families) and Reward models up to 8B parameters.
Datasets: UltraFeedback and AlpacaFarm.
Controlled Variables: They constructed reward models with varying levels of accuracy and reward variance by manipulating the training data (mixing on-policy and off-policy preference pairs) and by artificially scaling rewards to create a "perfectly accurate but low variance" model.
Baselines: Comparison against direct optimization of the ground truth reward and various standard reward models.

3. Key Contributions & Theoretical Results

The paper establishes three fundamental theoretical results:

A. Low Reward Variance Implies Slow Optimization (Theorem 1 & 4)

Regardless of a reward model's accuracy, if it induces low reward variance for the current policy, the RLHF objective landscape becomes flat.

Mechanism: Low variance implies that the reward model cannot distinguish between outputs that the policy currently considers probable. Consequently, the gradient norm of the objective function vanishes (or becomes very small).
Result: The time $t_\gamma$ required to increase the expected reward by a constant $\gamma$ is lower-bounded by the inverse of the reward variance.
$t_\gamma = \Omega\left( \frac{1}{\text{Var}[r_{RM}]^{1/3}} \right)$
Implication: Even a perfectly accurate reward model can lead to arbitrarily slow optimization if it fails to separate probable outputs (low variance).

B. Accuracy $\neq$ Effectiveness (Theorem 2 & 5)

More accurate reward models are not necessarily better teachers.

Counter-Intuitive Finding: The authors prove the existence of a scenario where a perfectly accurate reward model (ranking all pairs correctly) induces low variance, leading to extremely slow ground truth reward maximization. Conversely, a less accurate model that induces high variance can maximize the ground truth reward much faster.
Reasoning: Accuracy only cares about the order of rewards, not the magnitude of separation. A model can rank outputs correctly but assign them nearly identical reward values, creating a flat landscape.

C. Context-Dependent Optimality (Theorem 3 & 6)

A reward model that works well for one language model may fail for another.

Interaction: Reward variance is a joint property of the reward model and the current policy. A model that induces high variance for Policy A might induce low variance for Policy B (because Policy B assigns probability mass to different outputs).
Implication: Reward models cannot be evaluated in isolation. A "universal" ranking of reward models (e.g., via RewardBench) is fundamentally flawed because the optimal teacher depends on the specific student policy being aligned.

4. Experimental Results

The experiments on models up to 8B parameters corroborate the theoretical findings:

Variance vs. Accuracy Correlation:
- Reward Variance showed a strong positive correlation (Pearson $\approx$ 0.98) with the rate of reward increase during training.
- Accuracy showed weak or even negative correlation with reward increase.
- A combined metric of "Variance + Accuracy" was the most predictive of ground truth reward improvement.
The "Perfectly Accurate" Trap:
- In experiments, a reward model constructed to be 100% accurate but with low reward variance performed significantly worse than less accurate models. It resulted in a flat objective landscape and slow convergence.
- Conversely, models with moderate accuracy but high variance achieved faster ground truth reward maximization.
Policy Dependence:
- Different initial policies (e.g., Pythia-1B vs. Llama-3.2-1B) responded best to different reward models.
- For example, a reward model that induced the highest ground truth reward for a Pythia-based policy was not the best for a Llama-based policy, confirming that the "best teacher" is policy-specific.
Proxy vs. Ground Truth:
- Surprisingly, in the early stages of training, using a proxy reward model (with high variance) sometimes yielded better ground truth reward increases than optimizing directly with the ground truth reward, likely due to the ground truth reward inducing lower variance for the specific policy distribution.

5. Significance and Impact

Redefining Reward Model Evaluation: The paper challenges the community to move beyond accuracy-based benchmarks (like RewardBench). It argues that reward variance is a critical, previously overlooked metric for evaluating reward models.
Optimization-Centric View: It shifts the focus from "how well does the RM predict human preference?" to "how well does the RM facilitate the optimization of the policy?"
Practical Guidelines:
- When training or selecting reward models, practitioners should ensure the model induces sufficient variance for the target policy, not just high ranking accuracy.
- Reward models should be evaluated in conjunction with the specific language model they are intended to guide.
- Techniques like reward scaling or margin maximization during RM training might be necessary to prevent flat landscapes, provided they are done carefully to avoid noise amplification.

Conclusion:
A good reward model for RLHF must be accurate (to avoid reward hacking) AND varied (to ensure efficient optimization). The paper provides the first theoretical formalization of why accuracy alone is insufficient and demonstrates that the interplay between reward variance and the policy distribution is the fundamental determinant of RLHF success.

What Makes a Reward Model a Good Teacher? An Optimization Perspective

1. The Problem: The "Flat Landscape" Trap

2. The Core Lesson: You Need "Spicy" Feedback

3. The "One Size Does Not Fit All" Rule

4. Real-World Experiments

Summary: What Makes a Good Teacher?

1. Problem Statement

2. Methodology

Theoretical Framework

Experimental Setup

3. Key Contributions & Theoretical Results

A. Low Reward Variance Implies Slow Optimization (Theorem 1 & 4)

B. Accuracy ≠\neq= Effectiveness (Theorem 2 & 5)

C. Context-Dependent Optimality (Theorem 3 & 6)

4. Experimental Results

5. Significance and Impact

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

B. Accuracy $\neq$ Effectiveness (Theorem 2 & 5)