The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

The Big Problem: The "One-Trick Pony" Syndrome

Imagine you have a brilliant student, let's call him Alex, who is great at solving math problems. He doesn't just know one way to solve a problem; he has a whole toolbox of different strategies. Sometimes he draws a picture, sometimes he uses algebra, and sometimes he guesses and checks. If you ask him to solve a problem 10 times, he might come up with 10 slightly different, all-correct solutions. This is Diversity.

Now, you decide to train Alex to be even better using a strict coach (Reinforcement Learning). The coach says, "I don't care how you solve it, as long as you get the right answer. But if you try a new method and fail, I'll punish you. If you stick to the one method that worked yesterday, I'll reward you."

What happens?
Alex stops experimenting. He realizes that the safest bet is to memorize the single "perfect" method that worked last time. He becomes a One-Trick Pony.

The Good: If you ask him the exact same type of problem, he gets it right 100% of the time (Pass@1 goes up).
The Bad: If you ask him to solve the problem 10 times in a row, he gives you the exact same answer 10 times. If that one answer happens to be wrong for a specific variation, he fails every single time. Worse, if you ask him a slightly different type of problem (like switching from algebra to geometry), he forgets how to do it entirely because he stopped practicing his other skills. This is called Catastrophic Forgetting.

This is exactly what is happening to AI models right now. They are getting better at getting the first answer right, but they are losing their ability to think creatively and handle new situations.

The Old Solution: The "Reverse-KL" Trap

For a long time, the AI community tried to fix this by using a mathematical rule called Reverse-KL Divergence.

The Analogy:
Think of the AI's knowledge as a campfire.

The Base Model (before training) is a wide, crackling fire with sparks flying everywhere. It's warm and covers a lot of ground.
The Reverse-KL rule acts like a heavy glass dome placed over the fire. It forces all the heat and sparks to concentrate into one tiny, intense point in the center.
Result: The center is blazing hot (very accurate on known problems), but the edges are cold. The fire has lost its spread. It can't warm the whole room anymore.

The paper argues that this "glass dome" is actually the cause of the problem, not the solution. It forces the AI to narrow its focus too much, killing its creativity and memory.

The New Solution: DPH-RL (The "Rehearsal" Method)

The authors propose a new framework called DPH-RL (Diversity-Preserving Hybrid RL). Instead of forcing the AI to narrow its focus, they use a different mathematical rule (Forward-KL or JS-Divergence) that acts like a Rehearsal Mechanism.

The Analogy:
Imagine the AI is an actor preparing for a play.

The Old Way (Reverse-KL): The director says, "Forget everything you've ever learned. Only memorize this one line. If you say anything else, you're fired." The actor becomes robotic and forgets their other lines.
The New Way (DPH-RL): The director says, "We are going to practice the new scenes, but every day, we must also rehearse the old script."
- The AI is split into two groups:
  1. The Explorers: For hard problems, the AI is told, "Go wild! Try anything!" (No restrictions).
  2. The Rehearsers: For problems the AI already knows how to solve, the AI is forced to look back at its "old script" (the original model) and say, "Make sure you can still solve this the old way, too."

By constantly "rehearsing" the old knowledge while learning new tricks, the AI keeps its "fire" wide and warm. It doesn't just learn the new trick; it remembers the old ones.

Why This Matters (The Results)

The paper tested this on two very different tasks: SQL (writing database code) and Math.

Better at Variety: When asked to solve a problem 10 times, the new AI (DPH-RL) gave 10 different, correct answers. The old AI (Reverse-KL) gave the same answer 10 times.
Better at New Things: When tested on problems it had never seen before (Out-of-Domain), the new AI didn't forget how to solve them. The old AI forgot almost everything.
Efficiency: The new method is actually cheaper to run. It doesn't need a second "teacher" model running in the background; it just uses the data it already generated to do the rehearsing.

The Takeaway

The paper's main message is simple: Don't force your AI to be a perfectionist on just one path.

By changing the mathematical "rule" that guides the AI (switching from Reverse-KL to Forward-KL or JS-Divergence), we stop the AI from forgetting its past and losing its creativity. We turn the AI from a rigid robot that only knows one trick into a versatile expert that can handle many different challenges, just like a human who has practiced both the basics and the advanced techniques.

In short: The key to a smarter, more diverse AI isn't just giving it more rewards; it's giving it a better way to remember who it was before it started training.

1. Problem Statement

The paper addresses a critical paradox observed in fine-tuning Large Language Models (LLMs) using Reinforcement Learning with Verifiable Reward (RLVR):

The Paradox: While RLVR significantly improves single-attempt accuracy (Pass@1), it frequently causes a degradation in multi-attempt performance (Pass@k) and leads to catastrophic forgetting of previously acquired skills.
The Cause: The community has largely relied on the standard Reverse-KL divergence ( $D_{KL}(\pi_\theta || \pi_{ref})$ ) as a regularization term to constrain policy updates. The authors argue that Reverse-KL is "mode-seeking," meaning it forces the policy to converge on a single high-probability solution path. This narrows the output distribution, suppressing diversity and causing the model to overfit to specific training patterns, thereby failing to generalize to out-of-domain (OOD) tasks or generate diverse correct solutions.
The Gap: Existing solutions focus on entropy control, reward shaping, or hybrid training paradigms, but the potential of alternative $f$ -divergences (specifically mass-covering ones) as a proactive solution to preserve diversity has been largely overlooked.

2. Methodology: DPH-RL Framework

The authors propose Diversity-Preserving Hybrid RL (DPH-RL), a framework that reimagines the divergence term not just as a constraint, but as an active "rehearsal mechanism" to maintain knowledge diversity.

Core Mechanism

Instead of using a single Reverse-KL term, DPH-RL utilizes mass-covering $f$ -divergences (such as Forward-KL and Jensen-Shannon (JS) divergence).

Forward-KL ( $D_{KL}(\pi_{ref} || \pi_\theta)$ ): This divergence penalizes the new policy ( $\pi_\theta$ ) if it fails to cover modes present in the reference policy ( $\pi_{ref}$ ). It acts as a "mass-covering" force, ensuring the model retains the ability to generate diverse solution styles found in the initial knowledge base.
JS Divergence: A symmetric alternative that balances the similarity between the reference and current policies, preventing collapse while allowing exploration.

Two-Stage Training Strategy

To balance exploration and retention, the dataset $D$ is partitioned into two subsets based on the reference model's performance:

$D_{exp}$ (Exploration Set): Contains challenging queries where the reference model struggles.
- Strategy: The KL divergence penalty is removed entirely. The model is allowed to perform pure policy optimization (maximizing reward) to discover new, better solutions.
$D_{pef}$ (Near-Perfect Set): Contains queries where the reference model already performs well (e.g., high success rate).
- Strategy: The model is trained using the mass-covering $f$ -divergence loss (Forward-KL or JS). This forces the model to "rehearse" these known correct solutions, preventing catastrophic forgetting and maintaining diversity on tasks it has already mastered.

Implementation Efficiency

Generator-Based Approach: A key innovation is computing the divergence using pre-sampled data from the reference policy ( $\pi_{ref}$ $π_{r e f}$ ).
- This eliminates the need for an online reference model during the training loop.
- The divergence is calculated as an expectation over a static dataset of pre-sampled responses, making the method computationally efficient and comparable to standard GRPO.

3. Key Contributions

Systematic Analysis of Diversity Collapse: The paper identifies the standard Reverse-KL divergence as a primary driver of Pass@k degradation and catastrophic forgetting. It demonstrates that Reverse-KL's mode-seeking nature actively accelerates the loss of solution diversity.
DPH-RL Framework: Introduces a novel framework that reframes the divergence term as a diversity-preserving mechanism. By employing Forward-KL and JS divergence on mastered tasks, it creates an "anchor dataset" effect, mimicking human rehearsal to prevent forgetting.
Theoretical Guarantee: The authors derive an Enhanced Monotonic Improvement Theorem for TRPO-style algorithms using their method. They prove that leveraging expert behavior (via the $D_{pef}$ set and Forward-KL) provides a strictly better lower bound on policy improvement compared to standard TRPO, accelerating convergence.
Orthogonality: The method is orthogonal to existing strategies (entropy control, reward shaping), meaning it can be combined with them for further gains.

4. Experimental Results

The authors evaluated DPH-RL on SQL generation (Bird, Spider datasets) and Mathematical Reasoning (AIME, AMC, Math500) using Llama-3.1-8B and Qwen2.5-7B/32B models.

In-Domain Performance:
- DPH-RL (specifically DPH-JS) consistently outperformed baselines (GRPO, DAPO, Reverse-KL) in both Pass@1 and Pass@k.
- On the Bird dataset, DPH-JS improved Pass@8 by 4.3% over GRPO and 3.3% over DAPO.
Out-of-Domain (OOD) Generalization:
- Models trained with Reverse-KL or no KL suffered severe performance drops on OOD tasks (e.g., SQL-trained models tested on Math).
- DPH-RL maintained strong OOD performance. For example, on OOD math tasks, DPH-F and DPH-JS outperformed DAPO by 8.35% and 7.6% respectively in average accuracy.
Diversity Preservation:
- Experiments showed that Reverse-KL models degenerated into a single solution style (94% of outputs in one style), whereas Forward-KL and JS models maintained multi-style outputs (preserving the distribution of the base model).
- DPH-RL effectively balanced "Keep Rate" (retaining old skills) and "Additional Exploration" (learning new skills).
Efficiency: The generator-based implementation required no online reference model inference, making it training-efficient.

5. Significance

Paradigm Shift: The paper challenges the dogma that Reverse-KL is the default choice for RLHF/RLVR. It argues that the choice of divergence is a critical, overlooked hyperparameter that dictates the trade-off between performance and diversity.
Solving Catastrophic Forgetting: By treating the divergence term as a rehearsal mechanism, the method offers a robust solution to the catastrophic forgetting problem inherent in sequential learning, without needing external replay buffers or massive data storage.
Generalizability: The approach is model-agnostic and task-agnostic, demonstrated to work across different model sizes (7B to 32B) and domains (Code, Math).
Practical Impact: It provides a simple, plug-and-play modification to existing RLVR pipelines (like GRPO) that significantly improves the reliability and diversity of reasoning models, making them more suitable for real-world applications where multiple valid solutions may exist.

In conclusion, the paper demonstrates that Forward-KL and JS divergences are superior to Reverse-KL for preserving the diversity of reasoning paths in LLMs, offering a mathematically grounded and empirically validated path to building more general and robust reasoning agents.