Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

This paper introduces DARS-Breadth, a novel RLVR framework that synergistically combines difficulty-adaptive rollout sampling to enhance exploration of hard problems (depth) with large-batch training to increase data diversity (breadth), thereby unlocking significant and simultaneous improvements in both Pass@K and Pass@1 reasoning metrics for Large Language Models.

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Hanhui Li, Yiwei Wang, Xiaodan Liang, Jing Tang

Published 2026-04-14
📖 4 min read☕ Coffee break read

Imagine you are training a brilliant but slightly stubborn student (the AI) to solve incredibly difficult math problems. You want them to get better at two things:

  1. Solving the hardest problems (even if they have to try many times to get it right).
  2. Getting the right answer on the very first try (so they are fast and reliable).

For a long time, the standard way of training these students (called RLVR) had a major flaw. It was like a teacher who only gave extra attention to "medium-difficulty" homework. If a problem was too easy, the teacher ignored it. If a problem was too hard and the student kept getting it wrong, the teacher gave up on it, thinking, "This student just can't do this."

The authors of this paper realized this was a mistake. They found that to truly unlock the student's genius, you need to balance Depth (how hard the problems are) and Breadth (how many problems you practice at once).

Here is the breakdown of their solution, DARS, using simple analogies:

1. The Problem: The "Fairness" Trap

In the old method (GRPO), the teacher treated every student's attempt equally.

  • The Issue: If a student tried 8 times on a super-hard problem and failed every time, the teacher thought, "Well, the average success rate for this group is low, so this problem isn't worth much attention."
  • The Result: The student never learned the hard stuff. They got good at medium problems but hit a ceiling on the really tough ones.

2. The Solution: "Depth" (DARS)

The authors introduced DARS (Difficulty Adaptive Rollout Sampling). Think of this as a smart tutor who changes their strategy based on the student's struggle.

  • The Analogy: Imagine the student is trying to climb a mountain.
    • Old Method: The teacher says, "Try 8 times. If you fall, move to the next mountain."
    • DARS Method: The teacher watches the first few attempts.
      • If the student is breezing through the easy hill, the teacher says, "Good job, move on." (Waste less time).
      • If the student is stuck on a steep, rocky cliff (a hard problem), the teacher says, "Okay, this is tough. Let's try 32 times instead of 8. We will keep throwing rocks at this wall until we find a path that works."
  • The Magic: By spending more "tries" (rollouts) on the hard problems, the student eventually finds the solution. This is Depth. It teaches the model to think deeper and solve harder puzzles.

3. The Secret Ingredient: "Breadth"

The authors also discovered something surprising about Breadth (the number of different problems you practice in one session).

  • The Analogy: Imagine the student is practicing piano.
    • Small Breadth: They practice 128 songs, but they only play 8 notes for each. They get bored and stop thinking creatively.
    • Large Breadth: They practice 3,000 songs, playing a few notes for each.
  • The Result: When the student practices more songs at once (Large Breadth), they stay "awake" and curious. They don't get stuck in a rut. In AI terms, this keeps the entropy (randomness/creativity) high. It prevents the student from memorizing one specific trick and forgetting how to think generally. This boosts their ability to get the answer right on the first try (Pass@1).

4. The Grand Finale: The Synergy

The paper's biggest discovery is that Depth and Breadth are best friends, not enemies.

  • Depth (DARS) makes the model smart enough to solve the hardest problems if given enough tries (Pass@K).
  • Breadth (Large Batch) makes the model reliable enough to solve problems quickly on the first try (Pass@1).

When you combine them (DARS-Breadth), you get a student who is both a deep thinker (can solve complex riddles) and a fast worker (gets it right immediately).

Summary of the "Magic Formula"

  1. Don't be fair to everyone: Give more practice time to the problems the student is struggling with (Depth).
  2. Don't be too narrow: Practice a huge variety of problems at once to keep the student's mind fresh and creative (Breadth).
  3. Combine them: Use the smart tutor (DARS) on a massive practice schedule (Breadth).

The Result: The AI doesn't just get slightly better; it unlocks a new level of reasoning capability, solving math problems that were previously impossible, and doing so more reliably than ever before. It's like turning a smart calculator into a genius mathematician.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →