Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

The Big Problem: The "Perfect" Student Who Forgot How to Think

Imagine you have a brilliant student (a Large Language Model, or LLM) who is very good at solving math problems. They have a natural way of thinking that is creative and diverse. Sometimes they solve a problem in a clever, unexpected way; other times, they use a standard method.

Now, imagine you hire a strict tutor (Reinforcement Learning) to help this student get perfect scores. The tutor says: "If you get the answer right, you get a gold star. If you get it wrong, you get a red X."

Over time, the student learns to only give the answers that guarantee a gold star. They stop trying the creative, risky, or unusual methods because they might fail. They become hyper-focused on one specific way to solve things.

The result? The student gets perfect scores on the tests they practice, but they lose their creativity. If a new, tricky problem appears that requires a different approach, the student freezes because they've forgotten how to think outside the box. In the paper, this is called "Mode Collapse" or losing diversity.

The Paper's Solution: The "Filter" vs. The "Tutor"

The authors argue that the problem isn't the goal (getting the right answer); the problem is how the student is being taught to get there.

They propose a new method called DMVR (Distributional Matching with Verifiable Rewards). Instead of just rewarding the "gold star" answers and punishing the rest, they change the rules of the game.

The Analogy: The Sieve and the River

Imagine the student's original thinking process is a river flowing in many directions. Some paths lead to the ocean (correct answers), and some lead to dead ends (wrong answers).

Old Method (RL/GRPO): The tutor tries to force the river into a single, narrow canal that leads directly to the ocean. It works great for getting water to the ocean, but the river becomes a stagnant ditch. All the other paths dry up.
New Method (DMVR): Instead of forcing the river, the authors put a sieve (a filter) over the river.
1. They let the river flow naturally.
2. They catch the water that hits the "dead ends" (wrong answers) and throw it away.
3. They keep the water that hits the "ocean" (correct answers).
4. Crucially: They don't force the water to flow only one way to the ocean. They keep the natural flow of the river, just removing the bad parts.

This way, the student still learns to give correct answers, but they keep their diverse, creative ways of getting there.

The Secret Sauce: The "Dial" (Alpha)

The paper introduces a special control knob called $\alpha$ (Alpha). This dial lets you decide exactly how much "diversity" you want to keep versus how much "precision" (getting the right answer every time) you want.

Turn the dial to "Precision" (High Alpha): The student becomes like a robot. They pick the single most likely correct answer. They are very accurate, but less creative. This is similar to the old methods.
Turn the dial to "Diversity" (Low Alpha): The student becomes like an explorer. They try many different paths to the correct answer. They might make a few mistakes, but they are much more likely to find a solution to a hard problem that no one else could solve.
Turn the dial to "The Middle" (Balanced Alpha): You get the best of both worlds.

Why Does This Matter? (The "Lean" Experiment)

The authors tested this on Lean, a computer program used to prove complex mathematical theorems.

The Challenge: Proving a hard theorem is like finding a needle in a haystack. Sometimes, the only way to find the needle is to try a million different search strategies.
The Result:
- The old methods (like GRPO) were great at finding the needle if it was easy to see, but they often gave up on hard problems because they stopped trying different search strategies.
- The new method ( $\alpha$ -DPG) kept trying many different strategies. Even if the student didn't get the answer on the first try, they were much more likely to find it if they were allowed to try 256 different times.

The Takeaway

The paper teaches us a valuable lesson about AI and learning: You don't have to sacrifice creativity to get accuracy.

By changing how we "filter" the AI's learning process—rather than just punishing it for mistakes—we can create models that are not only smart and accurate but also diverse and robust. They can handle the easy problems with precision and the hard problems with creative exploration.

In short: Don't just teach the AI to be right; teach it to keep all its options open while filtering out the wrong ones. "Whatever remains, however improbable, must be the truth."

1. Problem Statement

Reinforcement Learning from Verifiable Rewards (RLVR), such as PPO and GRPO, has become the standard for tuning Large Language Models (LLMs) on reasoning tasks. However, recent evidence suggests these methods suffer from a significant loss of diversity (often called "mode collapse").

The Core Issue: RLVR implicitly optimizes the Reverse KL divergence ( $D_{KL}(\pi || p)$ ) to a target distribution. Reverse KL is "mode-seeking" (or zero-forcing); it forces the model to concentrate probability mass on a few high-reward modes while ignoring other valid regions of the target distribution.
The Consequence: While models become accurate (high precision on single samples), they fail to explore the solution space effectively. This is particularly detrimental in tasks like formal theorem proving, where harder problems may only be solvable via rare derivations, requiring high coverage (the probability of finding at least one correct solution among many samples, e.g., pass@256).

2. Methodology: Distributional Matching with Verifiable Rewards (DMVR)

The authors propose a framework called DMVR (Distributional Matching with Verifiable Rewards) to explicitly define the target distribution and control the trade-off between precision and diversity using $\alpha$ -divergences.

A. The Target Distribution

Instead of relying on the implicit target of RLVR, the authors define an explicit target distribution $p_x(y)$ :
$p_x(y) \propto \pi_{base}(y|x) \cdot v(y, x)$
Where:

$\pi_{base}$ is the pre-trained base model.
$v(y, x)$ is a binary verifier (1 if correct, 0 if incorrect).
Properties: This distribution filters out all incorrect answers but preserves the relative probabilities of the correct answers as they existed in the base model. It is the distribution closest to the base model (in terms of KL) that satisfies the correctness constraint.

B. The Optimization Objective: $\alpha$ -DPG

To approximate this target, the authors use Distributional Policy Gradient (DPG) algorithms, specifically minimizing the $\alpha$ -divergence between the policy $\pi_\theta$ and the target $p_x$ .

The $\alpha$ -Divergence Family: This family interpolates between two extremes:
- $\alpha \to 0$ (Forward KL, $D_{KL}(p || \pi)$ ): "Mass-covering." It penalizes the model for missing modes of the target. This preserves diversity but may assign probability to low-reward regions.
- $\alpha \to 1$ (Reverse KL, $D_{KL}(\pi || p)$ ): "Mode-seeking." It penalizes the model for assigning mass to regions where the target is zero. This maximizes precision but collapses diversity.
The Mechanism: The authors introduce $\alpha$ -DPG, which minimizes $D_{f_\alpha}(\pi_\theta || p_x)$ $D_{f_{α}} (π_{θ} ∣∣ p_{x})$ . By tuning the hyperparameter $\alpha \in [0, 1]$ $α \in [0, 1]$ , they can smoothly control the trade-off:
- Low $\alpha$ (e.g., 0.25) favors coverage/diversity.
- High $\alpha$ (e.g., 0.999) favors precision, recovering behavior similar to standard RLVR (GRPO/PPO).

C. Implementation Details

Pseudo-Reward: The gradient is derived using a pseudo-reward function based on the derivative of the $f$ -divergence generator.
Stability: To handle high variance in the pseudo-reward (especially for low $\alpha$ ), the authors clip the reward factor and use a leave-one-out baseline.
Partition Function: The normalization constant (partition function $Z_x$ ) is estimated online using importance sampling from the current policy, avoiding the need for massive pre-computation.

3. Key Contributions

DMVR Framework: A unified framework that treats RLVR as a specific case of distributional matching, clarifying that the "diversity loss" stems from the choice of divergence (Reverse KL) rather than the target distribution itself.
Theoretical Unification: The paper demonstrates that RLVR, Rejection Sampling Fine-Tuning (RS-FT), and KL-DPG are all special cases of minimizing an $f$ -divergence to a filtered target distribution.
$\alpha$ -DPG Algorithm: A novel method to interpolate between Forward and Reverse KL, allowing practitioners to explicitly tune the precision-diversity trade-off.
Pareto Frontier Achievement: Empirical evidence showing that $\alpha$ -DPG models lie on the Pareto frontier of precision vs. coverage, outperforming prior methods in coverage without sacrificing precision.

4. Experimental Results

The method was evaluated on the Lean theorem-proving benchmark using the DeepSeek-Prover-V1.5-SFT (7B) model.

Precision vs. Coverage (Pareto Frontier):
- Base SFT: High coverage, low precision.
- Standard RL (GRPO, RLOO): High precision, low coverage (mode collapse).
- $\alpha$ -DPG: By varying $\alpha$ $α$ , the authors generated a curve of models.
  - $\alpha \approx 0.999$ : Matches or exceeds GRPO in precision (pass@1) while maintaining higher coverage.
  - $\alpha \approx 0.25$ : Achieves the highest coverage (pass@256) among all methods, significantly outperforming GRPO and diversity-preserving baselines like Rw-Ulkly.
Problem Difficulty Analysis:
- High- $\alpha$ models (like GRPO) convert many "medium" problems to "easy" but cause "hard" problems to become "unsolved" (diversity collapse).
- Low- $\alpha$ models ( $\alpha=0.25$ ) are more conservative; they improve sample efficiency on fewer problems but crucially preserve the ability to solve hard problems that high- $\alpha$ models forget.
Diversity Metrics:
- Models with lower $\alpha$ exhibited higher Shannon Entropy and Simpson Index in tactics and premises, confirming they explore a wider variety of proof strategies.
- There is a clear anticorrelation between pass@1 (precision) and diversity, and a positive correlation between diversity and pass@256 (coverage).
Perplexity Analysis:
- Unlike some claims that RL discovers "new" solutions, the perplexity analysis showed that generated solutions were already highly probable under the base model. The improvement comes from reweighting existing capabilities, not creating new ones.

5. Significance and Conclusion

Diagnosing RL Failure: The paper provides a rigorous theoretical explanation for why RLVR leads to mode collapse: it is an artifact of optimizing the Reverse KL divergence, not an inherent limitation of verifiable rewards.
Controllable Reasoning: It offers a practical knob ( $\alpha$ ) for researchers to balance the need for immediate correctness (precision) against the need for robust exploration (coverage), which is vital for complex reasoning tasks where the "correct" path is unknown or rare.
Formal Verification: In domains like formal theorem proving, where a single missed derivation can mean failure, the ability to maintain high coverage (via low $\alpha$ ) while improving precision is a critical advancement over standard RL approaches.
Generalizability: While tested on Lean, the framework is general and applicable to any task with a verifiable reward signal, suggesting a path forward for more robust reasoning in LLMs.

In summary, the paper argues that "Whatever remains, however improbable, must be the truth" by filtering the base model's distribution. However, to avoid the "mode-seeking" trap of standard RL, one must use a divergence (like Forward KL or intermediate $\alpha$ ) that respects the remaining diversity, rather than forcing the model to collapse onto a single mode.

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

The Big Problem: The "Perfect" Student Who Forgot How to Think

The Paper's Solution: The "Filter" vs. The "Tutor"

The Analogy: The Sieve and the River

The Secret Sauce: The "Dial" (Alpha)

Why Does This Matter? (The "Lean" Experiment)

The Takeaway

1. Problem Statement

2. Methodology: Distributional Matching with Verifiable Rewards (DMVR)

A. The Target Distribution

B. The Optimization Objective: α\alphaα-DPG

C. Implementation Details

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

B. The Optimization Objective: $\alpha$ -DPG