IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

This paper establishes compute-optimal scaling laws for on-policy LLM reinforcement learning by demonstrating that the ideal number of parallel rollouts per problem increases predictably with the compute budget before saturating, driven by solution sharpening on easy tasks and coverage expansion on hard ones, while providing practical allocation rules for batch size and update steps to maximize training efficiency.

Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are the coach of a team of AI students (Large Language Models) preparing for a massive, high-stakes exam. You have a limited amount of computing power (think of this as your "budget" of energy, time, and money) to help them study.

The big question this paper answers is: How should you spend that budget to get the best results?

Should you have them study 100 different questions once? Or should you have them study just 5 questions, but repeat them 1,000 times? Or should you study 50 questions, 20 times each?

The authors of this paper, "IsoCompute Playbook," ran thousands of experiments to find the perfect recipe. Here is the breakdown in simple terms.

The Three Ways to Spend Your Budget

The researchers identified three main ways to use your computing power:

  1. Parallel Rollouts (nn): How many times does the AI try to solve the same question at once? (Like having 10 students all try to solve the same math problem simultaneously).
  2. Batch Size (BpB_p): How many different questions do you give the AI in one go? (Like handing out a worksheet with 50 different problems).
  3. Iterations (MM): How many times do you repeat the whole study session? (Like going over the same worksheet 10 times).

The total budget is simply: Questions ×\times Attempts per Question ×\times Number of Sessions.

The Golden Rule: "More Attempts, Fewer Questions"

The most surprising finding is that as you get more budget, you shouldn't just give the AI more questions. Instead, you should make it try harder on the questions it already has.

  • Low Budget: If you are broke (low compute), you should give the AI a wide variety of questions (many different problems) but let it try each one only a few times. This is like a "shotgun approach"—trying to hit something by covering a lot of ground.
  • High Budget: As you get richer (high compute), you should stop giving it new questions and instead force it to retry the same questions over and over until it gets them perfect. This is like a "sniper approach"—focusing intensely on a few targets to master them.

Why?

  • On Easy Problems: If the AI can already solve the problem 80% of the time, trying it 100 times helps it figure out the remaining 20% and become 100% perfect. It "sharpens" the answer.
  • On Hard Problems: If the problem is incredibly difficult, the AI might only solve it 1% of the time. Trying it 1,000 times increases the odds that it will finally stumble upon the one correct solution. It "expands coverage."

The "Traffic Jam" Analogy (Why not just study more questions?)

You might think, "Why not just give the AI 1,000 different questions and let it solve each one once?"

The paper explains that when you train an AI on many different problems at once, the problems start to interfere with each other. It's like a traffic jam. If the AI tries to learn how to solve a math problem and a coding problem at the exact same time, the lessons can get mixed up, and it might forget how to do the math while learning the code.

By focusing on fewer questions but trying them many times, you avoid this traffic jam. The AI gets a clear, strong signal on what to learn without getting confused by too many different topics at once.

The "Stability Knob"

There is one more variable: Batch Size (how many different questions you throw at the AI at once).

The paper found that this is like a volume knob on a stereo.

  • If you turn it too low (too few questions), the AI might get bored or stuck.
  • If you turn it too high (too many questions), it gets confused (the traffic jam mentioned above).
  • The Sweet Spot: As long as you keep the volume in a "moderate" range, it doesn't matter much exactly where it is. The real magic comes from how many times you retry the questions (the Parallel Rollouts).

The "Recipe" for Success

So, if you are a practitioner trying to train an AI today, here is the cheat sheet:

  1. Start with a "Healthy" Setup: Make sure your AI isn't too stressed (too hard) or too relaxed (too easy). Adjust your training rules based on whether the problems are easy or hard.
  2. Don't Just Add More Data: If you have more computing power, don't just buy more textbooks. Instead, make your students study the current textbook more deeply.
  3. The Shift:
    • Small Budget: Give many questions, few tries.
    • Big Budget: Give fewer questions, many tries.
  4. Watch Out for Overfitting: If your textbook is too small (not enough unique questions), studying it too much will make the AI memorize the answers rather than learning the concepts. If you have a small dataset, don't try to study it too deeply.

Summary

Think of training an AI like training for a marathon.

  • Old Way: Run 100 different short sprints.
  • New Way (The Paper's Advice): Run 10 different sprints, but run each one 50 times until your form is perfect.

As you get more energy (compute), you stop running new sprints and start perfecting the ones you know. This "IsoCompute Playbook" tells you exactly how to balance that effort to get the fastest time possible.