Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

This paper proposes Quantile Advantage Estimation (QAE), a method that replaces the mean baseline in value-free RL with a group-wise K-quantile baseline to stabilize training by preventing entropy collapse and explosion, thereby achieving sustained reasoning improvements on mathematical benchmarks.

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Picture: Teaching a Robot to Solve Math

Imagine you are training a robot (a Large Language Model) to solve difficult math problems. You give it a problem, it tries to solve it, and you give it a "thumbs up" (reward) if it gets the answer right, or a "thumbs down" if it fails.

The goal is to teach the robot to think better. However, the researchers found that the current way of teaching these robots is broken. It's like trying to teach a student by only looking at their average test score.

The Problem: The "Average" Trap

In the current methods (like GRPO and DAPO), the robot generates several answers for one question. The system calculates the average score of all those answers and uses that as a benchmark.

The Analogy: The Classroom of 10 Students
Imagine a classroom where 9 students fail a hard math test, and 1 student gets it right.

  • The Old Way (Mean Baseline): The teacher calculates the class average. It's very low.
    • The 9 students who failed get a "thumbs down" (negative score).
    • The 1 student who succeeded gets a "thumbs up."
    • The Flaw: Because the average is so low, the system thinks everyone did poorly compared to the "average." It punishes the 9 students too harshly, even if some of them were close to the right answer. This causes the robot to get confused, stop exploring new ideas, and eventually give up (Entropy Collapse).

The Other Problem: The "Wild Party"
Sometimes, the opposite happens. If the robot gets a few lucky correct answers, the average goes up. Suddenly, the system thinks everyone is doing great, even the ones who failed. It stops punishing the failures. The robot starts guessing wildly, generating nonsense just to be different (Entropy Explosion).

The Result: The training oscillates between the robot being too scared to try anything new (Collapse) and the robot going crazy with random guesses (Explosion).

The Solution: The "Top 40%" Rule (QAE)

The authors propose a new method called Quantile Advantage Estimation (QAE). Instead of using the average score as the benchmark, they use a threshold based on a specific percentile (like the top 40% or bottom 40%).

The Analogy: The Sports Coach
Imagine a coach who doesn't care about the team's average score. Instead, the coach sets a specific line: "If you are in the top 40% of performance, you get a reward. If you are in the bottom 60%, you get a penalty."

But here is the magic trick: The line moves depending on how hard the question is.

  1. On Hard Questions (The "Struggle" Phase):

    • Imagine a question is so hard that only 1 out of 10 attempts is correct.
    • Old Way: The average is low. The 1 winner gets a tiny reward; the 9 losers get huge penalties.
    • QAE Way: The coach says, "This question is hard! Let's lower the bar. We only care about the rare success."
    • The 1 winner gets a huge "Thumbs Up!" The 9 losers? They get ignored. They get a score of zero. The coach doesn't punish them for failing a hard question; they just don't get a reward.
    • Result: The robot learns to focus on finding that one rare success without being discouraged by the failures.
  2. On Easy Questions (The "Mastery" Phase):

    • Imagine a question is easy, and 9 out of 10 attempts are correct.
    • Old Way: The average is high. The 9 winners get a reward; the 1 loser gets a penalty.
    • QAE Way: The coach says, "This question is easy! We don't need to reward the winners anymore. Let's focus on the remaining failures."
    • The 9 winners? They get ignored. They get a score of zero. The 1 loser gets a "Thumbs Down."
    • Result: The robot learns to stop making silly mistakes on easy questions, rather than getting complacent.

Why This is a Game Changer

1. The "80/20" Rule of Learning
The paper found that with this new method, 80% of the robot's attempts get a score of zero. They are neither rewarded nor punished.

  • Why is this good? It's like a teacher who stops grading every single homework problem. Instead, they only give feedback on the specific problems where the student is either struggling to find a solution or making a careless mistake.
  • This saves energy and focuses the robot's learning on the most important moments.

2. Preventing the "Panic" and the "Wild Party"
By ignoring the "middle" attempts (the ones that are just okay), the robot stays in a "Goldilocks Zone."

  • It doesn't panic and stop exploring (Collapse) because it's not constantly being punished for failing hard questions.
  • It doesn't go wild (Explosion) because it's not constantly being rewarded for doing easy things.

The Bottom Line

The paper argues that the secret to making AI smarter isn't just tweaking the fine details of how it writes words (token-level). It's about how we grade the answers.

  • Old Method: "Here is the class average. You are above it? Good. Below it? Bad." (This causes chaos).
  • New Method (QAE): "If the question is hard, we only celebrate the winners. If the question is easy, we only correct the losers. Everyone else? Keep practicing, but we aren't grading you today."

This simple change stabilizes the training, prevents the AI from getting confused or lazy, and leads to much better performance on complex math and reasoning tasks.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →