ExGRPO: Learning to Reason from Experience

ExGRPO is a novel framework that enhances the efficiency and stability of reinforcement learning from verifiable rewards by identifying and prioritizing valuable rollout experiences based on correctness and entropy, thereby significantly improving reasoning performance across various large language models compared to standard on-policy methods.

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a brilliant but inexperienced student how to solve complex math puzzles. You give them a problem, they try to solve it, and you check the answer.

The Old Way (Standard RL):
In the traditional method, the student tries to solve the problem. If they get it right, you say "Good job!" and they learn a tiny bit. If they get it wrong, you say "Try again."
The problem? After that single attempt, you throw the student's entire thought process into the trash. Even if they made a great logical step halfway through before messing up the final calculation, that brilliant thinking is lost forever. You make them start from scratch every time. This is incredibly wasteful, like reading a book, learning one sentence, and then burning the book before reading the next page.

The New Way (ExGRPO):
The paper introduces ExGRPO, which is like giving that student a smart, organized notebook instead of a trash can.

Here is how ExGRPO works, broken down into simple analogies:

1. The "Goldilocks" Filter (Choosing the Right Problems)

Imagine you have a pile of homework problems ranging from "How to tie your shoes" (too easy) to "Quantum Physics" (too hard).

  • Too Easy: The student solves it instantly without thinking. They don't learn anything new.
  • Too Hard: The student gives up immediately or guesses randomly. They get frustrated and learn nothing.
  • Just Right: The student struggles a little, thinks hard, and eventually figures it out. This is where the magic happens.

ExGRPO acts like a strict but fair teacher. It looks at the student's past attempts and only keeps the "Just Right" problems in the notebook. It throws away the ones that were too easy or the ones where the student was completely lost. This ensures the student spends their time practicing the things that will actually make them smarter.

2. The "Clear Thinking" Detector (Entropy)

Sometimes, a student might get the right answer by accident, or by writing a messy, confusing paragraph that happens to contain the right number.

  • High Entropy (Messy Thinking): The student is rambling, trying 10 different random paths, and getting confused. Even if they get the right answer, their reasoning is shaky.
  • Low Entropy (Clear Thinking): The student's thoughts are calm, direct, and logical. They know exactly what they are doing.

ExGRPO has a special sensor that detects "messy thinking." If the student's notebook entry is full of rambling and confusion (high entropy), the system says, "Nope, this isn't a good example to learn from." It only saves the entries where the student thought clearly and logically. This prevents the student from learning bad habits or "lucky guesses."

3. The "Smart Notebook" (Experience Replay)

Instead of throwing away the "Just Right" and "Clear Thinking" attempts, ExGRPO puts them in a Smart Notebook.

  • The Mix: When it's time to study, the teacher doesn't just give the student new problems. They give them a mix: 50% new problems to explore, and 50% problems from the notebook to review.
  • The Benefit: By reviewing the "Golden" past attempts, the student reinforces what they already know and builds on it. They don't have to re-learn the basics every day. This makes the learning process much faster and more stable.

Why This Matters

The paper shows that this method works wonders, especially for:

  • Weak Students: It stops them from getting stuck in a loop of failure. By reviewing their few "lucky hits" (correct answers they got early on), they gain confidence and learn faster.
  • Strong Students: It helps them reach higher levels of reasoning by focusing on the most valuable, challenging problems rather than wasting time on easy ones.

In a Nutshell:
ExGRPO is like upgrading from a "Try, Fail, Forget" system to a "Try, Analyze, Organize, and Review" system. It teaches the AI to be a better student by curating its own best moments of learning, ensuring it only practices the problems that matter and thinks clearly while doing so.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →