ExGRPO: Learning to Reason from Experience

Imagine you are trying to teach a brilliant but inexperienced student how to solve complex math puzzles. You give them a problem, they try to solve it, and you check the answer.

The Old Way (Standard RL):
In the traditional method, the student tries to solve the problem. If they get it right, you say "Good job!" and they learn a tiny bit. If they get it wrong, you say "Try again."
The problem? After that single attempt, you throw the student's entire thought process into the trash. Even if they made a great logical step halfway through before messing up the final calculation, that brilliant thinking is lost forever. You make them start from scratch every time. This is incredibly wasteful, like reading a book, learning one sentence, and then burning the book before reading the next page.

The New Way (ExGRPO):
The paper introduces ExGRPO, which is like giving that student a smart, organized notebook instead of a trash can.

Here is how ExGRPO works, broken down into simple analogies:

1. The "Goldilocks" Filter (Choosing the Right Problems)

Imagine you have a pile of homework problems ranging from "How to tie your shoes" (too easy) to "Quantum Physics" (too hard).

Too Easy: The student solves it instantly without thinking. They don't learn anything new.
Too Hard: The student gives up immediately or guesses randomly. They get frustrated and learn nothing.
Just Right: The student struggles a little, thinks hard, and eventually figures it out. This is where the magic happens.

ExGRPO acts like a strict but fair teacher. It looks at the student's past attempts and only keeps the "Just Right" problems in the notebook. It throws away the ones that were too easy or the ones where the student was completely lost. This ensures the student spends their time practicing the things that will actually make them smarter.

2. The "Clear Thinking" Detector (Entropy)

Sometimes, a student might get the right answer by accident, or by writing a messy, confusing paragraph that happens to contain the right number.

High Entropy (Messy Thinking): The student is rambling, trying 10 different random paths, and getting confused. Even if they get the right answer, their reasoning is shaky.
Low Entropy (Clear Thinking): The student's thoughts are calm, direct, and logical. They know exactly what they are doing.

ExGRPO has a special sensor that detects "messy thinking." If the student's notebook entry is full of rambling and confusion (high entropy), the system says, "Nope, this isn't a good example to learn from." It only saves the entries where the student thought clearly and logically. This prevents the student from learning bad habits or "lucky guesses."

3. The "Smart Notebook" (Experience Replay)

Instead of throwing away the "Just Right" and "Clear Thinking" attempts, ExGRPO puts them in a Smart Notebook.

The Mix: When it's time to study, the teacher doesn't just give the student new problems. They give them a mix: 50% new problems to explore, and 50% problems from the notebook to review.
The Benefit: By reviewing the "Golden" past attempts, the student reinforces what they already know and builds on it. They don't have to re-learn the basics every day. This makes the learning process much faster and more stable.

Why This Matters

The paper shows that this method works wonders, especially for:

Weak Students: It stops them from getting stuck in a loop of failure. By reviewing their few "lucky hits" (correct answers they got early on), they gain confidence and learn faster.
Strong Students: It helps them reach higher levels of reasoning by focusing on the most valuable, challenging problems rather than wasting time on easy ones.

In a Nutshell:
ExGRPO is like upgrading from a "Try, Fail, Forget" system to a "Try, Analyze, Organize, and Review" system. It teaches the AI to be a better student by curating its own best moments of learning, ensuring it only practices the problems that matter and thinks clearly while doing so.

1. Problem Statement

The paper addresses a critical inefficiency in Reinforcement Learning from Verifiable Rewards (RLVR) for Large Reasoning Models (LRMs).

The Bottleneck: Standard on-policy RLVR algorithms (like GRPO) generate rollouts (trajectories) to compute gradients but discard these experiences after a single update. This leads to massive computational waste and training instability, as the model cannot learn from its own past successful explorations.
The Gap: While Experience Replay is a standard technique in general RL, its application to RLVR for reasoning models is underexplored. Crucially, existing methods often treat all stored experiences equally, failing to account for the fact that not all past trajectories are equally valuable. Some "lucky hits" or high-entropy trajectories may contain flawed reasoning that, if replayed, could degrade performance (a "snowball effect").
Core Question: How can we identify, manage, and replay valuable reasoning experiences to maximize sample efficiency and stabilize training without introducing bias or instability?

2. Methodology: ExGRPO

The authors propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that integrates structured experience replay with a principled selection mechanism. The method operates in two main phases:

A. Experience Management (Selection & Partitioning)

Instead of random replay, ExGRPO filters experiences based on two key metrics identified through preliminary analysis:

Rollout Correctness (Difficulty): Experiences are partitioned into buckets based on the online correctness rate of the question ($Acc(q)$).
- Insight: The authors found that medium-difficulty questions (where the model has a partial success rate, e.g., 25%–75%) provide the most valuable learning signals. Easy questions offer little new information, while hard questions often yield low-quality "lucky" solutions.
- Mechanism: A Retired Set removes questions where the model has achieved 100% success to prevent overfitting on mastered tasks. The replay buffer is sampled using a Gaussian distribution centered at 0.5 (medium difficulty) to prioritize the "sweet spot."
Trajectory Entropy (Quality): For a selected question, the model selects the specific trajectory with the lowest entropy.
- Insight: Low-entropy trajectories correlate with logically valid Chain-of-Thought (CoT) reasoning. High-entropy trajectories often indicate uncertainty or "hallucinated" reasoning (e.g., unnecessary code generation) that leads to correct answers by chance but flawed logic.
- Mechanism: The system selects the lowest-entropy candidate from the stored successful rollouts for that question, ensuring high-quality CoT is prioritized.

B. Experiential Policy Optimization

ExGRPO unifies on-policy exploration and off-policy replay under a joint objective:

Mixed-Policy Objective: Each mini-batch consists of fresh on-policy samples ( $B_{on}$ ) and selected replayed samples ( $B_{exp}$ ), controlled by a ratio $\rho$ (set to 50%).
Importance Weighting: To correct for the distribution shift between the current policy ( $\pi_\theta$ ) and the policy that generated the replayed trajectory ( $\pi_{\theta_{past}}$ ), the replayed trajectory is reweighted using per-token importance ratios.
Policy Shaping: To prevent the model from collapsing into exploitation (ignoring exploration), the importance weight for replayed trajectories is transformed via a non-linear function $f(w) = \frac{w}{w+\beta}$ . This dampens high-probability signals and amplifies low-probability ones, encouraging the model to learn novel aspects of the experience.
Delayed Start: The mechanism is activated only after the model achieves a baseline Pass@1 threshold, ensuring the replay buffer contains high-quality data initially.

3. Key Contributions

First Systematic Analysis of Experience Value: The paper is the first to empirically identify rollout correctness and trajectory entropy as effective proxies for experience value in RLVR, demonstrating that medium-difficulty questions and low-entropy CoTs are optimal for learning.
ExGRPO Framework: A novel algorithm that combines bucketed sampling (based on difficulty) and entropy-based trajectory selection with a mixed-policy objective. It introduces Policy Shaping to balance exploitation of past successes with current exploration.
Stabilization of Weak Models: ExGRPO successfully stabilizes training on models where standard on-policy RLVR fails (e.g., Llama-3.1 8B Base), preventing "entropy explosion" and training collapse.
Theoretical Guarantees: The authors provide theoretical analysis proving the unbiasedness of the experiential gradient under exact importance weighting and derive variance bounds, showing that entropy selection helps control variance.

4. Experimental Results

The authors evaluated ExGRPO across five backbone models (ranging from 1.5B to 8B parameters, including Qwen and Llama families) on nine benchmarks.

Performance Gains:
- In-Distribution (Math): ExGRPO achieved an average gain of +3.5 points over on-policy RLVR baselines.
- Out-of-Distribution (General Reasoning): ExGRPO achieved an average gain of +7.6 points, demonstrating superior generalization.
- Specific Benchmarks: Notable improvements were seen on AIME24/25, AMC, and OlympiadBench.
Stability:
- On the Llama-3.1 8B Base model, on-policy RLVR collapsed (performance degraded), whereas ExGRPO enabled successful training and significant improvement.
- On the LUFFY model (a strong model trained with external data), ExGRPO enabled effective continual learning using the model's own experience, outperforming on-policy updates which caused degradation.
Ablation Studies:
- Removing Question Selection or Trajectory Selection significantly reduced performance.
- Using high-entropy trajectories (the opposite of ExGRPO's strategy) led to performance drops, confirming the "snowball effect" of bad reasoning.
- A replay ratio ( $\rho$ ) of 50% was found to be optimal; higher ratios stifled exploration, while lower ratios underutilized the replay buffer.

5. Significance

Efficiency: ExGRPO significantly improves sample efficiency by reusing valuable past computations, reducing the computational cost required to scale reasoning models.
Scalability: By stabilizing training on weaker models and enabling continual learning on stronger ones, ExGRPO provides a pathway for scaling RLVR to larger and more diverse model architectures.
Principled Experience Management: The work shifts the paradigm from "replay everything" to "replay what matters," establishing that the quality and difficulty of experience are as critical as the quantity. This offers a blueprint for future RL-based reasoning systems.

In conclusion, ExGRPO demonstrates that principled experience management—specifically targeting medium-difficulty problems and low-entropy reasoning chains—is a key ingredient for efficient, stable, and scalable reinforcement learning in large reasoning models.

ExGRPO: Learning to Reason from Experience

1. The "Goldilocks" Filter (Choosing the Right Problems)

2. The "Clear Thinking" Detector (Entropy)

3. The "Smart Notebook" (Experience Replay)

Why This Matters

1. Problem Statement

2. Methodology: ExGRPO

A. Experience Management (Selection & Partitioning)

B. Experiential Policy Optimization

3. Key Contributions

4. Experimental Results

5. Significance

More like this

LLM-Augmented Knowledge Base Construction For Root Cause Analysis

The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering