Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Imagine you are training a brilliant but slightly stubborn student (the AI) to solve incredibly difficult math problems. You want them to get better at two things:

Solving the hardest problems (even if they have to try many times to get it right).
Getting the right answer on the very first try (so they are fast and reliable).

For a long time, the standard way of training these students (called RLVR) had a major flaw. It was like a teacher who only gave extra attention to "medium-difficulty" homework. If a problem was too easy, the teacher ignored it. If a problem was too hard and the student kept getting it wrong, the teacher gave up on it, thinking, "This student just can't do this."

The authors of this paper realized this was a mistake. They found that to truly unlock the student's genius, you need to balance Depth (how hard the problems are) and Breadth (how many problems you practice at once).

Here is the breakdown of their solution, DARS, using simple analogies:

1. The Problem: The "Fairness" Trap

In the old method (GRPO), the teacher treated every student's attempt equally.

The Issue: If a student tried 8 times on a super-hard problem and failed every time, the teacher thought, "Well, the average success rate for this group is low, so this problem isn't worth much attention."
The Result: The student never learned the hard stuff. They got good at medium problems but hit a ceiling on the really tough ones.

2. The Solution: "Depth" (DARS)

The authors introduced DARS (Difficulty Adaptive Rollout Sampling). Think of this as a smart tutor who changes their strategy based on the student's struggle.

The Analogy: Imagine the student is trying to climb a mountain.
- Old Method: The teacher says, "Try 8 times. If you fall, move to the next mountain."
- DARS Method: The teacher watches the first few attempts.
  - If the student is breezing through the easy hill, the teacher says, "Good job, move on." (Waste less time).
  - If the student is stuck on a steep, rocky cliff (a hard problem), the teacher says, "Okay, this is tough. Let's try 32 times instead of 8. We will keep throwing rocks at this wall until we find a path that works."
The Magic: By spending more "tries" (rollouts) on the hard problems, the student eventually finds the solution. This is Depth. It teaches the model to think deeper and solve harder puzzles.

3. The Secret Ingredient: "Breadth"

The authors also discovered something surprising about Breadth (the number of different problems you practice in one session).

The Analogy: Imagine the student is practicing piano.
- Small Breadth: They practice 128 songs, but they only play 8 notes for each. They get bored and stop thinking creatively.
- Large Breadth: They practice 3,000 songs, playing a few notes for each.
The Result: When the student practices more songs at once (Large Breadth), they stay "awake" and curious. They don't get stuck in a rut. In AI terms, this keeps the entropy (randomness/creativity) high. It prevents the student from memorizing one specific trick and forgetting how to think generally. This boosts their ability to get the answer right on the first try (Pass@1).

4. The Grand Finale: The Synergy

The paper's biggest discovery is that Depth and Breadth are best friends, not enemies.

Depth (DARS) makes the model smart enough to solve the hardest problems if given enough tries (Pass@K).
Breadth (Large Batch) makes the model reliable enough to solve problems quickly on the first try (Pass@1).

When you combine them (DARS-Breadth), you get a student who is both a deep thinker (can solve complex riddles) and a fast worker (gets it right immediately).

Summary of the "Magic Formula"

Don't be fair to everyone: Give more practice time to the problems the student is struggling with (Depth).
Don't be too narrow: Practice a huge variety of problems at once to keep the student's mind fresh and creative (Breadth).
Combine them: Use the smart tutor (DARS) on a massive practice schedule (Breadth).

The Result: The AI doesn't just get slightly better; it unlocks a new level of reasoning capability, solving math problems that were previously impossible, and doing so more reliably than ever before. It's like turning a smart calculator into a genius mathematician.

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for enhancing the reasoning capabilities of Large Language Models (LLMs) in domains like mathematics and coding. However, the paper identifies two critical limitations in existing RLVR frameworks (specifically the GRPO algorithm and its variants):

Depth Neglect (Bias against Hard Problems): Standard GRPO calculates advantages based on group-relative rewards. The authors demonstrate that this mechanism creates a cumulative advantage bias where the algorithm disproportionately weights medium-difficulty problems while under-weighting high-difficulty, low-accuracy instances. Since solving hard problems is essential for improving complex reasoning (Pass@K), this bias caps the model's potential.
Ineffective Scaling of Rollouts: Simply increasing the number of rollouts (sampling size) per problem to find correct solutions does not consistently improve Pass@K and can sometimes degrade performance due to gradient noise or inefficient resource allocation.
Breadth Under-utilization: Existing methods often use small batch sizes (e.g., 128). The paper argues that increasing the Breadth (the number of unique training instances per iteration) is crucial for stabilizing training, maintaining high token-level entropy (exploration), and improving single-shot performance (Pass@1).

2. Methodology

The authors propose a unified framework called DARS-Breadth, which synergizes Depth (adaptive exploration of hard problems) and Breadth (scaling iteration instances).

A. Difficulty Adaptive Rollout Sampling (DARS)

DARS addresses the "Depth" problem by dynamically reallocating computational resources to difficult problems via a two-stage process:

Pre-Rollout Difficulty Estimation: A lightweight first stage generates a small number of trajectories ( $N_{pre}$ ) for each problem to estimate its empirical accuracy ( $\hat{a}_j$ ). The difficulty score is defined as $x_j = 1 - \hat{a}_j$ .
Multi-Stage Rollout Re-balancing: Based on the estimated difficulty, the algorithm allocates additional trajectories ( $\Delta n_j$ $Δ n_{j}$ ) to low-accuracy problems. The goal is to ensure the cumulative advantage for a problem increases with its difficulty.
- Equal-Treatment (ET) Schedule: Forces the cumulative advantage of all hard problems to match that of a medium-difficulty problem (accuracy 0.5). This induces a Log-Odds optimization objective.
- Hardness-Weighted (HW) Schedule: Allocates rollouts such that the cumulative advantage scales linearly with difficulty ( $x_j$ ). This induces a Maximum Likelihood optimization objective, theoretically equivalent to Maximum Likelihood Reinforcement Learning (MaxRL) but with lower variance.

B. Breadth Scaling

To address the "Breadth" problem, the authors replace standard PPO mini-batch updates with Full-Batch Updates.

Because DARS creates "ragged" batch sizes (different problems have different numbers of rollouts), standard mini-batching is incompatible.
The authors adopt full-batch gradient descent across multiple PPO epochs.
Effect: This significantly increases the batch size (e.g., from 128 to 3072), which reduces gradient noise and acts as an implicit entropy regularizer, preventing premature convergence and boosting Pass@1.

C. DARS-Breadth

The final method combines DARS (for depth) with Full-Batch updates (for breadth). This approach ensures that the model explores deep reasoning paths for hard problems while maintaining a broad, stable exploration across the dataset.

3. Key Contributions

Identification of Cumulative Advantage Bias: The paper provides a systematic analysis showing that standard GRPO under-weights high-difficulty samples, limiting Pass@K performance.
DARS Algorithm: Introduction of a two-stage sampling method that re-weights the training distribution toward hard problems. The authors theoretically prove that the HW schedule optimizes a Maximum Likelihood objective, while ET optimizes a Log-Odds objective.
Depth-Breadth Synergy: The discovery that Depth (adaptive sampling) and Breadth (large batch size) are orthogonal and complementary. Depth improves Pass@K, while Breadth improves Pass@1. Combining them yields simultaneous gains in both metrics.
Efficiency Gains: DARS achieves better performance than naive rollout scaling (e.g., setting $N=32$ for all problems) while using significantly fewer total rollouts, thereby improving training efficiency.

4. Experimental Results

The method was evaluated on Qwen2.5-Math (1.5B and 7B) and Llama-3.1-8B models across benchmarks like MATH-500, AIME24, AMC23, and OlympiadBench.

Pass@1 (Single-shot): DARS-Breadth consistently outperformed baselines. For example, on Qwen2.5-Math-7B, the Pass@1 (Avg@128) improved from 55.3 (Baseline) to 58.4 (DARS-HW-Breadth).
Pass@K (Sampling): DARS significantly boosted Pass@128. On Qwen2.5-Math-7B, Pass@128 increased from 81.4 to 83.4.
Test-Time Scaling: The method showed superior performance under test-time search (Majority Voting). On AIME24, DARS-HW-Breadth improved the 7B model's maj@16 score by 11.4 points over the baseline.
Efficiency: DARS methods required far fewer rollouts per prompt than "Depth-Naive" (fixed large rollout) to achieve higher performance. For instance, DARS-ET reduced average rollouts by ~60% compared to naive scaling while improving results.
Generalization: The method generalized to non-math domains (GPQA-Diamond, HumanEval) and different model architectures (OpenPangu on Ascend NPUs).

5. Significance

This paper fundamentally shifts the paradigm of RLVR training by demonstrating that exploration quality (Depth) and exploration quantity (Breadth) must be optimized jointly.

Theoretical Insight: It connects adaptive sampling strategies to established optimization objectives (Log-Odds and Maximum Likelihood), providing a theoretical basis for why focusing on hard examples works.
Practical Impact: It offers a scalable, efficient recipe for training reasoning models that outperforms current state-of-the-art methods without requiring massive increases in compute budget.
Future Direction: The authors suggest that dynamically adjusting the rollout cap ( $N_{max}$ ) could enable a curriculum learning approach, transitioning from Pass@K-focused exploration to Pass@1-focused refinement.

In summary, DARS-Breadth unlocks the full potential of RLVR by ensuring the model spends its compute on the hardest problems (Depth) while maintaining a stable, high-entropy training environment through large-scale batch updates (Breadth).