DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

The Big Problem: The "Forgetful Genius"

Imagine you are training a brilliant student (an AI) to solve incredibly hard math problems. You use a method called Reinforcement Learning (RL).

The Old Way (GRPO): The student tries to solve a problem. If they get it right, you say "Good job!" and they learn from it. If they get it wrong, you say "Try again."
The Flaw: In this old method, once the student finishes a set of problems, you throw away their scratch paper. You only keep the current moment. This is incredibly wasteful. It's like a chef tasting a soup, deciding it needs salt, and then throwing away the entire pot of soup before making the next batch. You are wasting all the "good attempts" the student made in the past.

The Current Fix (and why it fails)

Other researchers tried to fix this by keeping a "Replay Buffer"—a giant notebook where they write down every correct answer the student ever got. They force the student to study this notebook over and over.

The Problem with this approach:

It's too heavy: Storing every single correct answer takes up a massive amount of computer memory (like trying to carry a library in your backpack).
It makes the student rigid: If you force the student to memorize only the specific way they solved a problem yesterday, they stop thinking creatively. They become a "parrot" that repeats the same solution. If the problem changes slightly, they fail because they forgot how to explore new paths. This is called Mode Collapse (getting stuck in one narrow way of thinking).

The New Solution: DyJR (Dynamic Jensen-Shannon Replay)

The authors of this paper, DyJR, propose a smarter way to use past data. They argue: "Don't just memorize the answers; remember the variety of ways you tried to solve them."

Here is how DyJR works, using three simple concepts:

1. The "Fresh Fruit" Rule (Dynamic Buffer)

Imagine you are running a juice bar. You have a fridge (the buffer) to store fruit (past solutions).

Old Method: You stuff the fridge with fruit from last year, last month, and today. The old fruit rots and tastes bad, confusing the customers.
DyJR Method: You only keep the freshest fruit from the last few hours.
- Why? The AI changes very fast. A solution that was "correct" 100 steps ago might be weird or irrelevant now.
- The Trick: When the AI is just starting (the "chaos phase"), the fridge gets bigger to catch all the wild, creative attempts. As the AI gets better and calmer, the fridge shrinks to keep only the very latest, most relevant attempts. This saves massive amounts of memory.

2. The "Safety Net" (Jensen-Shannon Divergence)

Instead of forcing the AI to copy the old answers (which makes it rigid), DyJR uses a Safety Net.

Old Method: "You must solve this problem exactly like you did yesterday." (This kills creativity).
DyJR Method: "You must solve this problem, but don't wander too far away from the variety of ways you tried recently."
The Analogy: Imagine a dog on a walk.
- Old Method: The dog is on a short leash tied to a specific tree. It can't move.
- DyJR: The dog is on a long, elastic leash. It can run around, sniff new bushes, and explore (diversity), but the leash gently pulls it back if it runs into a ditch (prevents it from forgetting how to solve problems).
- This "pull" is the Jensen-Shannon (JS) Divergence. It's a mathematical way of saying, "Stay close to the group of good ideas, but don't just copy one specific idea."

3. The "High-Five" Strategy (Adaptive Selection)

Not all past answers are equal.

If the AI solved an easy problem 10 times in a row, that's boring.
If the AI solved a hard problem, even if it took 5 tries to get it right, that's gold.
DyJR is smart about what it saves. It prioritizes keeping the "hard-won" victories and the "creative attempts" from the early days, rather than just saving everything.

The Results: Why It Matters

When the researchers tested this on:

Math Problems: The AI got significantly better at solving complex logic puzzles.
SQL (Database Queries): The AI learned to write better code to ask databases for information.

The Magic:

Better Scores: The AI solved more problems correctly than the previous best methods.
Less Memory: It didn't need a giant computer to store old data. It was efficient.
More Creativity: The AI didn't get stuck in a rut. It kept exploring different ways to solve problems, which is crucial for intelligence.

Summary in One Sentence

DyJR is like a smart teacher who keeps a small, fresh notebook of the student's best recent attempts and uses it to gently guide the student's creativity, ensuring they don't forget how to think outside the box while still learning the right answers.

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for enhancing the reasoning capabilities of Large Language Models (LLMs), particularly through algorithms like Group Relative Policy Optimization (GRPO). However, current RLVR approaches face two critical bottlenecks:

Sample Inefficiency: On-policy algorithms discard valuable historical rollout data after a single update, wasting computational resources and preventing the model from learning from past successes.
Mode Collapse & Overfitting: Existing Experience Replay (ER) methods attempt to solve sample inefficiency by reusing historical trajectories as direct positive samples for policy gradient updates. The authors argue this approach is flawed because:
1. Indiscriminate Forward Updates: Directly maximizing the likelihood of historical data forces the model to overfit specific solution paths, eroding its exploratory potential and causing "mode collapse" (where the model relies on a single reasoning path).
2. Resource Inefficiency: Traditional ER methods often require massive buffers to store entire trajectory histories, leading to prohibitive GPU memory overhead.
3. Misguided Objective: The authors posit that historical data should be used to sustain diversity rather than simply reinforce accuracy.

2. Methodology: DyJR (Dynamic Jensen-Shannon Replay)

The authors propose DyJR, a regularization framework that redefines the role of experience replay from accuracy optimization to diversity preservation. It consists of two core innovations:

A. Time-Sensitive Dynamic Buffer (Data Construction)

Instead of storing all historical data, DyJR employs a non-uniform, dynamic buffer strategy:

FIFO & Max Age Constraint: The buffer strictly retains only samples generated within a specific "Max Age" ( $M$ ) window (e.g., the last 8 steps). It uses a First-In-First-Out (FIFO) protocol to evict stale data, ensuring the reference distribution tracks the current model's evolving capability boundary.
Bias-Aware Adaptive Selection: To handle varying task difficulties, the buffer uses a "High-to-Low" confidence admission strategy. It prioritizes high-confidence correct samples ( $C_{id} = G$ ) but relaxes criteria for harder tasks to capture rare solutions, preventing data starvation.
Time-Aware Adaptive Schedule: During the initial "warm-up" phase (first ~20 steps), the buffer fill rate is temporarily increased (from 5% to 20%) to capture high-entropy exploration patterns before the policy collapses, effectively smoothing the optimization trajectory.

B. Jensen-Shannon Divergence Regularization (Data Utilization)

Rather than using historical data for direct gradient updates (which causes overfitting), DyJR uses them as a distributional anchor:

The Mechanism: The algorithm treats the mixture of historical policies in the buffer as a reference distribution. It minimizes the Jensen-Shannon (JS) Divergence between the current policy and this reference mixture.
Why JS Divergence?
- Unlike Forward KL (which is mode-covering and can force the policy to average across diverse samples, leading to over-smoothing), JS divergence is symmetric and bounded.
- It provides a robust regularization signal that prevents the model from drifting too far from diverse successful paths without aggressively altering the optimization direction.
Implementation: The authors use a low-variance generative estimator to compute the JS divergence loss ( $L_{JS}$ ) without re-forwarding the model, making it computationally efficient.

Optimization Objective:
The total loss function combines the standard on-policy GRPO loss with the JS regularization term:
$L_{total}(\theta) = L_{GRPO}(\theta) + \alpha_{JS} \cdot L_{JS}(\theta)$

3. Key Contributions

Paradigm Shift: Redefines Experience Replay in RLVR from "accuracy optimization" to "diversity regularization," arguing that historical data's primary value is preserving exploration patterns.
Dynamic Data Construction: Introduces a time-sensitive buffer that adapts its capacity based on training stages (expanding during volatile early phases) and employs FIFO to ensure data freshness, significantly reducing memory overhead compared to static large buffers.
JS Divergence Regularization: Proposes using JS divergence as a constraint to prevent mode collapse, demonstrating its superiority over Forward KL and standard gradient updates in maintaining a healthy token probability distribution.
Efficiency: Achieves performance gains with negligible GPU memory overhead and training time comparable to the original GRPO.

4. Experimental Results

The authors evaluated DyJR on Mathematical Reasoning (using Qwen3-4B) and Text-to-SQL (using Llama-3.1-8B) tasks.

Mathematical Reasoning Benchmarks:
- DyJR achieved an average accuracy of 34.1% across six benchmarks, outperforming the GRPO baseline (29.8%) by 4.3%.
- It significantly outperformed other replay-based methods like RLEP (31.7%) and Ex-GRPO (32.8%), as well as static JS-constraint methods like DPH-RL (31.3%).
- Ablation Studies:
  - JS vs. Forward KL: JS divergence outperformed Forward KL (34.1% vs. 32.5%), confirming that symmetric constraints are better for non-stationary replay buffers.
  - Max Age ( $M$ ): Performance peaked at $M=8$ and declined as $M$ increased, validating the need for temporally proximal data.
  - Coefficient ( $\alpha_{JS}$ ): Optimal performance was found at $\alpha_{JS} = 0.05$ ; higher values restricted exploration, while lower values failed to prevent drift.
Text-to-SQL Tasks:
- DyJR achieved State-of-the-Art (SOTA) results, improving Pass@1 by +3.3% (BIRD) and +5.0% (Spider) over GRPO, demonstrating strong cross-domain generalization.
Diversity Analysis (Rank-k Token Probabilities):
- GRPO exhibited rapid entropy collapse, with Rank-1 token probabilities surging to >90% early in training.
- DyJR maintained a healthy distribution, keeping Rank-1 probabilities lower and redistributing mass to Rank-2 and Rank-3 tokens, indicating sustained exploration.
- Pass@k Scaling: DyJR showed superior scalability, continuing to improve performance as the sampling budget ( $k$ ) increased up to 1024, whereas GRPO stagnated.

5. Significance

Scalability: DyJR addresses the scalability bottleneck of RLVR by enabling the reuse of historical data without the memory costs of traditional replay or the diversity collapse of direct updates.
Training Dynamics: The paper provides a crucial insight that the "value" of historical data lies in the early high-entropy exploration patterns, not just the high-accuracy trajectories of later stages.
Generalizability: The method is model-agnostic and effective across different architectures (Qwen, Llama) and task types (Math, SQL), offering a robust solution for enhancing LLM reasoning without prohibitive computational costs.

In summary, DyJR offers a lightweight, effective regularization framework that preserves the "exploratory spirit" of RL training, preventing models from converging prematurely to suboptimal local solutions while maintaining high training efficiency.