Rewards as Labels: Revisiting RLVR from a Classification Perspective

Imagine you are teaching a very smart, but slightly confused, robot to solve math problems. You want it to get better, so you use a method called Reinforcement Learning with Verifiable Rewards (RLVR).

Here's how the old way worked, why it was broken, and how this new paper, "Rewards as Labels" (REAL), fixes it.

The Old Way: The "Volume Knob" Problem (GRPO)

In the previous method (called GRPO), the teacher (the computer algorithm) would ask the robot to solve a math problem 8 times.

If the robot got the answer right, the teacher gave it a "Good Job!" (Reward = 1).
If it got it wrong, the teacher gave it a "Try Again" (Reward = 0).

The teacher then looked at the robot's answers and tried to adjust its brain. But here was the problem: The teacher used the reward as a "Volume Knob" for the robot's confidence.

The "Too Loud" Mistake (Gradient Domination):
Imagine the robot is very confident it got a wrong answer (e.g., it thinks "2+2=5" with 99% certainty). Because it was so confident, the old teacher turned the "Volume Knob" all the way up. The robot's brain got a massive, shocking shock to fix it.
- The Result: The robot panicked. It over-corrected, forgetting everything else it knew, just to fix that one loud mistake. It was like a student who gets yelled at for one wrong answer and forgets how to do the rest of the test.
The "Too Quiet" Mistake (Gradient Misassignment):
Now, imagine the robot got the answer right, but it was unsure about it (e.g., it thought "2+2=4" but only had 50% confidence). The old teacher saw the low confidence and turned the "Volume Knob" down very low.
- The Result: The robot barely felt the "Good Job!" It didn't learn enough to become confident next time. It was like a student getting a quiet "meh" for a great answer, so they didn't bother studying that topic again.

The Summary of the Old Way: The teacher was too harsh on confident mistakes and too gentle on unsure successes. This made learning unstable and inefficient.

The New Way: The "Traffic Light" System (REAL)

The authors of this paper, Zepeng Zhai and his team, said: "Wait a minute. Why are we using rewards as volume knobs? Let's treat them like simple labels, like a Traffic Light."

They proposed a new framework called REAL (Rewards as Labels).

Instead of asking, "How loud should I yell at this answer?", they ask, "Is this answer a Green Light (Good) or a Red Light (Bad)?"

How REAL Works:

Categorization, Not Volume:
The teacher looks at all 8 answers.
- The correct ones are put in the Green Pile.
- The wrong ones are put in the Red Pile.
  The goal is simply to push the Green answers up and the Red answers down. It doesn't matter if the robot was 99% sure or 1% sure; the instruction is the same: "Move Green up, Move Red down."
The "Anchor" (The Safety Net):
To make sure the robot doesn't get confused, the teacher adds a Zero Point (an Anchor).
- If an answer is Green, the teacher says, "You must be above zero."
- If an answer is Red, the teacher says, "You must be below zero."
  This creates a clear, safe boundary. The robot never gets a "shock" that is too big, and it never gets a "pat on the back" that is too weak.

The Magic Analogy: The Balanced Seesaw

Think of the old method as a seesaw where the heavy kid (the confident mistake) pushes the light kid (the unsure success) so hard that the light kid flies off the board.

The new REAL method puts a spring under the seesaw.

If the heavy kid pushes too hard, the spring absorbs the shock so the board doesn't fly up.
If the light kid is too light, the spring gently lifts them up so they don't get stuck.
Result: The seesaw stays balanced, and everyone learns at a steady, safe pace.

Why This Matters (The Results)

The authors tested this on some of the hardest math problems (like the AIME and Olympiad competitions).

Stability: The robot didn't go crazy (entropy collapse) or get confused (entropy explosion). It learned steadily.
Performance: Even with a smaller brain (1.5 billion parameters), the new method beat the old "Volume Knob" method by a huge margin (about 6.7% better).
Scalability: When they used a bigger brain (7 billion parameters), it still won.

The Bottom Line

The paper argues that we don't need complex, loud rewards to teach AI. We just need to clearly label things as "Good" or "Bad" and treat the learning process like a classification problem (sorting things into bins) rather than a regression problem (adjusting a dial).

By doing this, they fixed the "shouting at confident mistakes" and "ignoring unsure successes" issues, making AI training for math and logic much more stable and effective.

Here is a detailed technical summary of the paper "Rewards as Labels: Revisiting RLVR from a Classification Perspective" (REAL).

1. Problem Statement

The paper addresses critical inefficiencies in Reinforcement Learning with Verifiable Rewards (RLVR), specifically within the dominant Group Relative Policy Optimization (GRPO) framework and its variants (e.g., DAPO, GSPO). While GRPO has successfully improved Large Language Models (LLMs) on reasoning tasks, the authors identify two fundamental gradient mismatches that lead to suboptimal policy updates:

Gradient Misassignment in Positives: For correct (positive) rollouts, GRPO assigns disproportionately large updates to tokens that already have high probabilities (easy tokens). Conversely, "hard" positive tokens (low probability under the current policy) receive weak gradients. This results in the model focusing on already optimized regions rather than correcting under-trained components.
Gradient Domination in Negatives: For incorrect (negative) rollouts, the gradient magnitudes in GRPO are unbounded. High-confidence negative tokens (tokens the model is very sure about but are wrong) generate exponentially large gradients that dominate the update, overwhelming contributions from other informative negative samples. This leads to unstable training and sensitivity to outliers.

These issues cause inefficient credit assignment, premature convergence to local optima, and training instability (e.g., entropy collapse or explosion).

2. Methodology: The REAL Framework

The authors propose REAL (Rewards as Labels), a novel framework that fundamentally rethinks policy optimization by shifting from a scalar reward weighting perspective to a classification perspective.

Core Conceptual Shift

Instead of treating verifiable rewards ( $r \in \{0, 1\}$ ) as scalar weights to scale gradients, REAL treats them as categorical labels. The objective is reformulated as a classification task: distinguishing between desirable (positive) and undesirable (negative) rollouts.

Technical Components

Logit Definition:
For a rollout $o_k$ , the model computes a length-normalized relative log-probability score (logit) $\bar{s}_k$ :
$\bar{s}_k = \frac{1}{|o_k|} \sum_{t=1}^{|o_k|} \log \frac{\pi_\theta(o_{k,t} | q)}{\pi_{old}(o_{k,t} | q)}$
This score measures the relative change in probability under the new policy compared to the old policy.
Anchor Logits:
To prevent ambiguous update directions (e.g., both positive and negative scores decreasing), REAL introduces a fixed anchor logit at 0.
- For positive samples, 0 acts as a negative logit.
- For negative samples, 0 acts as a positive logit.
  This enforces a clear separation: positive scores must be $>0$ and negative scores $<0$ .
Loss Function:
REAL utilizes a unified Softmax Cross-Entropy loss (specifically a variant of the Circle Loss or similar margin-based losses) to separate the sets of positive logits ( $S_+$ ) and negative logits ( $S_-$ ) relative to the anchor ( $S_0$ ).
$\mathcal{L}_{REAL} = \mathcal{L}_{CE}(S_+, S_0) + \mathcal{L}_{CE}(S_0, S_-)$
Where $\mathcal{L}_{CE}$ is defined as $\log(1 + \sum e^{z_{neg} - z_{pos}})$ . This formulation naturally balances the gradient allocation.
Theoretical Properties (Gradient Analysis):
The paper proves that REAL induces a monotonic and bounded gradient weighting.
- The gradient magnitude is upper-bounded by $1/\tau $(where$ \tau$ is a temperature parameter).
- For positives, the gradient decreases as the relative log-probability increases (preventing over-updating of easy tokens).
- For negatives, the gradient is strictly bounded, preventing outlier domination.
- Crucially: This bounded nature acts as an implicit, adaptive gradient clipping, eliminating the need for explicit KL-divergence penalties often required in GRPO to stabilize training.

3. Key Contributions

Identification of Gradient Mismatches: The paper formally identifies and analyzes "Gradient Misassignment in Positives" and "Gradient Domination in Negatives" as the root causes of instability in GRPO-style methods.
REAL Framework: Proposes a classification-based reformulation of RLVR that treats rewards as labels, utilizing anchor logits and softmax cross-entropy to achieve balanced gradient allocation.
Theoretical Guarantees: Demonstrates that REAL provides bounded, monotonic gradient weighting, ensuring stable training without explicit KL penalties.
Empirical Superiority: Extensive experiments show REAL consistently outperforms GRPO, DAPO, and GSPO across multiple model scales and benchmarks.

4. Experimental Results

The authors evaluated REAL on mathematical reasoning benchmarks (AIME 2024/2025, MATH 500, AMC 2023, Minerva, Olympiad Bench) using DeepSeek-R1-Distill-Qwen models (1.5B and 7B parameters).

1.5B Model Performance:
- REAL improved the average Pass@1 by 6.7% over DAPO and 9.5% over GRPO.
- It outperformed the strong baseline GSPO as well.
7B Model Performance:
- REAL continued to outperform DAPO by 6.2% and GSPO by 1.7%.
- Achieved an average Pass@1 of 63.2%.
Training Stability:
- Unlike GRPO (which suffers from entropy collapse) and DAPO (entropy explosion), REAL maintained stable entropy throughout training.
- It achieved competitive performance without any explicit KL penalty, confirming the efficacy of its implicit gradient clipping.
Ablation Studies:
- Removing Anchor Logits slightly reduced performance but remained stable.
- Using Binary Cross-Entropy (BCE) instead of Softmax resulted in lower performance (50.4% vs 52.6%), highlighting the importance of the group-wise competition inherent in the softmax formulation.
- The temperature parameter $\tau=0.5$ was found optimal; lower values ( $\tau=0.1$ ) caused instability due to unbounded gradients.

5. Significance

Paradigm Shift: The paper challenges the prevailing view of RLVR as a scalar reward optimization problem, successfully reframing it as a classification task. This offers a more principled approach to credit assignment.
Stability and Efficiency: By solving the gradient mismatch issues, REAL enables more stable training dynamics, reducing the risk of model collapse and the need for complex hyperparameter tuning (like KL penalties).
Scalability: The method scales effectively from 1.5B to 7B models, suggesting it is a robust foundation for future large-scale reasoning models.
Simplicity: Despite its theoretical depth, the implementation relies on standard classification losses, making it easy to integrate into existing RLVR pipelines.

In conclusion, REAL provides a stable, efficient, and theoretically grounded alternative to GRPO, significantly advancing the state-of-the-art in training LLMs for complex reasoning tasks via verifiable rewards.