The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

This paper proposes Diversity-Preserving Hybrid RL (DPH-RL), a novel framework that mitigates diversity collapse and catastrophic forgetting in RLVR by leveraging mass-covering f-divergences as a rehearsal mechanism to maintain broad solution coverage, thereby simultaneously improving both single-attempt accuracy and multi-attempt performance.

Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

Published 2026-03-04
📖 5 min read🧠 Deep dive

The Big Problem: The "One-Trick Pony" Syndrome

Imagine you have a brilliant student, let's call him Alex, who is great at solving math problems. He doesn't just know one way to solve a problem; he has a whole toolbox of different strategies. Sometimes he draws a picture, sometimes he uses algebra, and sometimes he guesses and checks. If you ask him to solve a problem 10 times, he might come up with 10 slightly different, all-correct solutions. This is Diversity.

Now, you decide to train Alex to be even better using a strict coach (Reinforcement Learning). The coach says, "I don't care how you solve it, as long as you get the right answer. But if you try a new method and fail, I'll punish you. If you stick to the one method that worked yesterday, I'll reward you."

What happens?
Alex stops experimenting. He realizes that the safest bet is to memorize the single "perfect" method that worked last time. He becomes a One-Trick Pony.

  • The Good: If you ask him the exact same type of problem, he gets it right 100% of the time (Pass@1 goes up).
  • The Bad: If you ask him to solve the problem 10 times in a row, he gives you the exact same answer 10 times. If that one answer happens to be wrong for a specific variation, he fails every single time. Worse, if you ask him a slightly different type of problem (like switching from algebra to geometry), he forgets how to do it entirely because he stopped practicing his other skills. This is called Catastrophic Forgetting.

This is exactly what is happening to AI models right now. They are getting better at getting the first answer right, but they are losing their ability to think creatively and handle new situations.

The Old Solution: The "Reverse-KL" Trap

For a long time, the AI community tried to fix this by using a mathematical rule called Reverse-KL Divergence.

The Analogy:
Think of the AI's knowledge as a campfire.

  • The Base Model (before training) is a wide, crackling fire with sparks flying everywhere. It's warm and covers a lot of ground.
  • The Reverse-KL rule acts like a heavy glass dome placed over the fire. It forces all the heat and sparks to concentrate into one tiny, intense point in the center.
  • Result: The center is blazing hot (very accurate on known problems), but the edges are cold. The fire has lost its spread. It can't warm the whole room anymore.

The paper argues that this "glass dome" is actually the cause of the problem, not the solution. It forces the AI to narrow its focus too much, killing its creativity and memory.

The New Solution: DPH-RL (The "Rehearsal" Method)

The authors propose a new framework called DPH-RL (Diversity-Preserving Hybrid RL). Instead of forcing the AI to narrow its focus, they use a different mathematical rule (Forward-KL or JS-Divergence) that acts like a Rehearsal Mechanism.

The Analogy:
Imagine the AI is an actor preparing for a play.

  • The Old Way (Reverse-KL): The director says, "Forget everything you've ever learned. Only memorize this one line. If you say anything else, you're fired." The actor becomes robotic and forgets their other lines.
  • The New Way (DPH-RL): The director says, "We are going to practice the new scenes, but every day, we must also rehearse the old script."
    • The AI is split into two groups:
      1. The Explorers: For hard problems, the AI is told, "Go wild! Try anything!" (No restrictions).
      2. The Rehearsers: For problems the AI already knows how to solve, the AI is forced to look back at its "old script" (the original model) and say, "Make sure you can still solve this the old way, too."

By constantly "rehearsing" the old knowledge while learning new tricks, the AI keeps its "fire" wide and warm. It doesn't just learn the new trick; it remembers the old ones.

Why This Matters (The Results)

The paper tested this on two very different tasks: SQL (writing database code) and Math.

  1. Better at Variety: When asked to solve a problem 10 times, the new AI (DPH-RL) gave 10 different, correct answers. The old AI (Reverse-KL) gave the same answer 10 times.
  2. Better at New Things: When tested on problems it had never seen before (Out-of-Domain), the new AI didn't forget how to solve them. The old AI forgot almost everything.
  3. Efficiency: The new method is actually cheaper to run. It doesn't need a second "teacher" model running in the background; it just uses the data it already generated to do the rehearsing.

The Takeaway

The paper's main message is simple: Don't force your AI to be a perfectionist on just one path.

By changing the mathematical "rule" that guides the AI (switching from Reverse-KL to Forward-KL or JS-Divergence), we stop the AI from forgetting its past and losing its creativity. We turn the AI from a rigid robot that only knows one trick into a versatile expert that can handle many different challenges, just like a human who has practiced both the basics and the advanced techniques.

In short: The key to a smarter, more diverse AI isn't just giving it more rewards; it's giving it a better way to remember who it was before it started training.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →