What Makes a Reward Model a Good Teacher? An Optimization Perspective

This paper argues that for a reward model to effectively guide Reinforcement Learning from Human Feedback (RLHF), it must induce sufficient reward variance to ensure a non-flat optimization landscape, revealing that accuracy alone is an insufficient metric for evaluating a reward model's teaching capability.

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a very talented but naive student (the Language Model) how to write great stories. You can't write the story for them, so you hire a Teacher (the Reward Model) to grade their work and tell them what to improve.

For years, the industry has believed that the best teacher is simply the one who is most accurate. If the teacher correctly identifies that Story A is better than Story B 99% of the time, we assume they are the perfect guide.

This paper, however, argues that accuracy isn't everything. In fact, a teacher can be perfectly accurate but still be a terrible guide because they are too "boring" or "flat" in their feedback.

Here is the breakdown of the paper's findings using simple analogies:

1. The Problem: The "Flat Landscape" Trap

Imagine the student is trying to climb a mountain to reach the peak of "Great Writing."

  • The Reward Model is the map and the compass.
  • Accuracy is how correctly the map points to the peak.
  • Reward Variance is how much the map tells the student to move.

The Paper's Discovery:
If your teacher is accurate but gives low variance feedback, it's like a teacher who says, "Story A is slightly better than Story B, and Story C is also slightly better than Story D." The differences are so tiny that the student feels like they are walking on a flat, featureless plain. They don't know which direction to run because every step feels the same. They get stuck, and progress is agonizingly slow.

Conversely, a teacher who is slightly less accurate but gives high variance feedback is like a teacher who shouts, "Story A is AMAZING! Story B is TERRIBLE!" Even if they get a few details wrong, the huge difference in scores gives the student a clear, strong signal to run in the right direction. They climb the mountain much faster.

2. The Core Lesson: You Need "Spicy" Feedback

The paper proves mathematically that for the student to learn quickly, the teacher's scores need to have enough "spice" (variance).

  • Low Variance (Flat): The teacher gives scores like 0.51, 0.52, 0.53. The student is confused. Optimization is slow.
  • High Variance (Spicy): The teacher gives scores like 0.1, 0.5, 0.9. The student knows exactly what to aim for. Optimization is fast.

The Shocking Finding:
A teacher who is 100% accurate but gives flat scores (low variance) will actually produce a worse student than a teacher who is only 70% accurate but gives spicy, high-variance scores. The "perfect" teacher fails because the student can't figure out how to move.

3. The "One Size Does Not Fit All" Rule

The paper also discovered that a teacher who is great for one student might be terrible for another.

  • Student A might be a beginner who needs a teacher who screams "Good job!" and "Bad job!" (High variance) to get motivated.
  • Student B might be an advanced student who needs a teacher who gives very specific, nuanced, and subtle feedback (which might look like low variance to an outsider).

If you take a teacher who works wonders for Student A and hand them to Student B, the student might get confused and stop learning. The "best" teacher depends entirely on the specific student they are teaching.

4. Real-World Experiments

The researchers tested this with computer models (up to 8 billion parameters).

  • They created "perfect" teachers that were accurate but gave flat scores.
  • They created "imperfect" teachers that gave big, clear score differences.
  • Result: The "imperfect" teachers with high variance helped the language models learn much faster and better than the "perfect" flat teachers. In some cases, using a proxy teacher (the imperfect one) was even better than letting the model try to optimize based on the "ground truth" directly, because the proxy teacher provided a clearer path up the mountain.

Summary: What Makes a Good Teacher?

To summarize the paper in one sentence: A good teacher for AI isn't just the one who is right; it's the one who gives feedback loud and clear enough to be heard.

If you are building AI systems, don't just look for the reward model with the highest accuracy score. Look for the one that creates a steep, clear path (high variance) for the AI to follow, even if it means the teacher isn't 100% perfect. Otherwise, your AI might get stuck on a flat plain, knowing what's right but unable to move toward it.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →