Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

The Big Picture: Teaching a Robot to Solve Math

Imagine you are training a robot (a Large Language Model) to solve difficult math problems. You give it a problem, it tries to solve it, and you give it a "thumbs up" (reward) if it gets the answer right, or a "thumbs down" if it fails.

The goal is to teach the robot to think better. However, the researchers found that the current way of teaching these robots is broken. It's like trying to teach a student by only looking at their average test score.

The Problem: The "Average" Trap

In the current methods (like GRPO and DAPO), the robot generates several answers for one question. The system calculates the average score of all those answers and uses that as a benchmark.

The Analogy: The Classroom of 10 Students
Imagine a classroom where 9 students fail a hard math test, and 1 student gets it right.

The Old Way (Mean Baseline): The teacher calculates the class average. It's very low.
- The 9 students who failed get a "thumbs down" (negative score).
- The 1 student who succeeded gets a "thumbs up."
- The Flaw: Because the average is so low, the system thinks everyone did poorly compared to the "average." It punishes the 9 students too harshly, even if some of them were close to the right answer. This causes the robot to get confused, stop exploring new ideas, and eventually give up (Entropy Collapse).

The Other Problem: The "Wild Party"
Sometimes, the opposite happens. If the robot gets a few lucky correct answers, the average goes up. Suddenly, the system thinks everyone is doing great, even the ones who failed. It stops punishing the failures. The robot starts guessing wildly, generating nonsense just to be different (Entropy Explosion).

The Result: The training oscillates between the robot being too scared to try anything new (Collapse) and the robot going crazy with random guesses (Explosion).

The Solution: The "Top 40%" Rule (QAE)

The authors propose a new method called Quantile Advantage Estimation (QAE). Instead of using the average score as the benchmark, they use a threshold based on a specific percentile (like the top 40% or bottom 40%).

The Analogy: The Sports Coach
Imagine a coach who doesn't care about the team's average score. Instead, the coach sets a specific line: "If you are in the top 40% of performance, you get a reward. If you are in the bottom 60%, you get a penalty."

But here is the magic trick: The line moves depending on how hard the question is.

On Hard Questions (The "Struggle" Phase):
- Imagine a question is so hard that only 1 out of 10 attempts is correct.
- Old Way: The average is low. The 1 winner gets a tiny reward; the 9 losers get huge penalties.
- QAE Way: The coach says, "This question is hard! Let's lower the bar. We only care about the rare success."
- The 1 winner gets a huge "Thumbs Up!" The 9 losers? They get ignored. They get a score of zero. The coach doesn't punish them for failing a hard question; they just don't get a reward.
- Result: The robot learns to focus on finding that one rare success without being discouraged by the failures.
On Easy Questions (The "Mastery" Phase):
- Imagine a question is easy, and 9 out of 10 attempts are correct.
- Old Way: The average is high. The 9 winners get a reward; the 1 loser gets a penalty.
- QAE Way: The coach says, "This question is easy! We don't need to reward the winners anymore. Let's focus on the remaining failures."
- The 9 winners? They get ignored. They get a score of zero. The 1 loser gets a "Thumbs Down."
- Result: The robot learns to stop making silly mistakes on easy questions, rather than getting complacent.

Why This is a Game Changer

1. The "80/20" Rule of Learning
The paper found that with this new method, 80% of the robot's attempts get a score of zero. They are neither rewarded nor punished.

Why is this good? It's like a teacher who stops grading every single homework problem. Instead, they only give feedback on the specific problems where the student is either struggling to find a solution or making a careless mistake.
This saves energy and focuses the robot's learning on the most important moments.

2. Preventing the "Panic" and the "Wild Party"
By ignoring the "middle" attempts (the ones that are just okay), the robot stays in a "Goldilocks Zone."

It doesn't panic and stop exploring (Collapse) because it's not constantly being punished for failing hard questions.
It doesn't go wild (Explosion) because it's not constantly being rewarded for doing easy things.

The Bottom Line

The paper argues that the secret to making AI smarter isn't just tweaking the fine details of how it writes words (token-level). It's about how we grade the answers.

Old Method: "Here is the class average. You are above it? Good. Below it? Bad." (This causes chaos).
New Method (QAE): "If the question is hard, we only celebrate the winners. If the question is easy, we only correct the losers. Everyone else? Keep practicing, but we aren't grading you today."

This simple change stabilizes the training, prevents the AI from getting confused or lazy, and leads to much better performance on complex math and reasoning tasks.

1. Problem Statement

The paper addresses a critical instability in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs), specifically in mathematical reasoning tasks. While RLVR (using algorithms like GRPO and DAPO) has shown promise, training often oscillates between two failure modes:

Entropy Collapse: The policy distribution becomes overly deterministic too early, suppressing exploration and trapping the model in suboptimal reasoning modes.
Entropy Explosion: The policy becomes overly stochastic, leading to inefficient exploration, noisy gradients, and stalled performance progress.

Root Cause Analysis:
The authors identify the mean-baseline used in value-free RL methods (e.g., GRPO, DAPO) as the primary culprit.

In these methods, the advantage is calculated as $A = (R - \text{mean}(R)) / \text{std}(R)$ .
Under reward outliers (e.g., a few highly successful responses in a group), the mean baseline inflates.
This causes otherwise competent responses to receive negative advantages, penalizing useful exploration and inducing entropy explosion.
Existing solutions (like token-level clipping or "Clip-Higher") attempt to mitigate collapse but often fail to prevent explosion or induce performance plateaus.

2. Methodology: Quantile Advantage Estimation (QAE)

The authors propose Quantile Advantage Estimation (QAE), a minimal modification that replaces the group-wise mean baseline with a group-wise K-quantile baseline.

Core Mechanism

For a query $q$ with a group of $G$ responses and binary rewards ( $R \in \{0, 1\}$ ), let $p(q)$ be the empirical success rate of the group. The K-quantile baseline $b_K(q)$ acts as a dynamic threshold:
$b_K(q) = \begin{cases} 0 & \text{if } p(q) \leq 1 - K \quad (\text{Hard Queries}) \\ 1 & \text{if } p(q) > 1 - K \quad (\text{Easy Queries}) \end{cases}$
This creates a two-regime gate:

Hard Queries ( $p(q) \leq 1-K$ ): The baseline is 0.
- Incorrect responses ( $R=0$ ) get Advantage $\approx 0$ (no penalty).
- Rare correct responses ( $R=1$ ) get positive Advantage, reinforcing nascent successes.
- Goal: Encourage exploration and exploit rare successes.
Easy Queries ( $p(q) > 1-K$ ): The baseline is 1.
- Correct responses ( $R=1$ ) get Advantage $\approx 0$ (no reward).
- Remaining failures ( $R=0$ ) get negative Advantage, penalizing residual errors.
- Goal: Suppress unproductive failures and consolidate learning.

Theoretical Guarantees

The paper provides a Two-Sided Entropy Safety proof under first-order softmax updates:

Explosion-Proof: In low-success regimes, the K-quantile baseline minimizes the entropy increase compared to any other baseline in $[0, 1]$ .
Collapse-Proof: In high-success regimes, it maximizes the entropy increase (preventing premature convergence).
Unlike token-level controls which only rescale step sizes, QAE fundamentally alters the response-level baseline, providing a provable bound on one-step entropy change.

Sparsity Effect

A key empirical consequence is sparse credit assignment. With a tuned $K$ (typically 0.4), approximately 80% of responses receive zero advantage. This focuses computational updates on the most informative samples (rare successes on hard problems, remaining failures on easy problems), effectively implementing an "80/20 rule" for RLVR.

3. Key Contributions

Diagnosis of Baseline Flaws: Identifies the mean-baseline in value-free RL as the source of entropy explosion, distinct from token-level heuristic failures.
QAE Algorithm: Introduces a simple, drop-in replacement for the mean baseline using a K-quantile, creating a deterministic switch between exploration and exploitation regimes based on query difficulty.
Theoretical Safety: Proves that QAE provides two-sided entropy safety, curbing both collapse and explosion, which previous token-level methods could not guarantee simultaneously.
Sparse Updates: Demonstrates that the method naturally sparsifies updates, concentrating learning on high-value samples without complex masking heuristics.

4. Experimental Results

The method was evaluated on AIME'24, AIME'25, and AMC'23 benchmarks using Qwen3-8B, 14B, and 30B base models.

Performance Gains:
- QAE consistently improved Pass@1 across all models and datasets.
- Example (Qwen3-8B-Base + DAPO): Pass@1 on AIME'24 increased from 39.69 to 48.23 (+21.5%).
- Pass@16 remained comparable or slightly improved, indicating better sample efficiency rather than just overfitting to a single solution.
Stability:
- Entropy Dynamics: While baseline DAPO showed a massive entropy spike (steps 10–80) followed by a performance plateau, QAE maintained stable entropy throughout training.
- Token Usage: Unlike baselines where "aha-moment" tokens (e.g., "wait", "perhaps") spiked and then vanished, QAE sustained a healthy co-growth of diverse reasoning tokens and accuracy.
Compatibility: QAE is orthogonal to existing techniques. It improved performance when combined with:
- Clip-Higher (Yu et al., 2025)
- Clip-Cov / KL-Cov (Cui et al., 2025)
- GSPO (Zheng et al., 2025)

5. Significance

Paradigm Shift: The paper reframes entropy regulation from a token-level tuning problem (adjusting clipping thresholds or token probabilities) to a baseline design problem.
Scalability: By stabilizing entropy and sparsifying updates, QAE enables more robust scaling of RLVR for LLM reasoning, addressing the "dual challenge" of avoiding both premature convergence and unproductive divergence.
Simplicity: The method requires only a one-line swap in the advantage calculation (replacing mean with quantile), making it highly practical for adoption in existing RLVR pipelines.

In conclusion, QAE offers a theoretically grounded, empirically robust solution to the instability plaguing current RLVR systems, enabling sustained performance gains in complex reasoning tasks by intelligently balancing exploration and exploitation through a quantile-based baseline.