Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

This paper introduces a particle filtering framework to rigorously analyze the accuracy-cost tradeoffs of parallel inference methods in large language models, establishing theoretical guarantees and identifying fundamental limits while demonstrating that sampling error alone does not fully predict final model accuracy.

Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay Krishnamurthy

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to solve a very difficult riddle, like a complex math problem or a tricky logic puzzle. You have a smart friend (the Large Language Model, or LLM) who is good at talking, but sometimes they ramble, get stuck, or give you a wrong answer.

In the past, if you wanted a better answer, you might ask your friend to write the solution 32 times and then pick the best one. This is called "Best-of-N." It works, but it's wasteful. You might generate 31 bad answers just to find one good one.

This paper introduces a smarter way to do this, called Sequential Monte Carlo (SMC), which is basically a fancy way of saying "Try, Check, and Prune."

Here is the simple breakdown of what the researchers did, using some everyday analogies.

1. The Problem: The "Bad GPS"

Imagine you are driving a car (the LLM) to a destination (the correct answer). You have a GPS (the Process Reward Model or PRM) that tells you how close you are to the goal at every turn.

  • The Catch: Your GPS is imperfect. Sometimes it says you are close when you are actually far away, or vice versa.
  • The Old Way (Best-of-N): You drive 32 different routes from start to finish. At the end, you look at all 32 destinations and pick the one that looks most like the goal. You wasted a lot of gas driving the 31 wrong routes all the way to the end.
  • The New Way (SMC): You start driving 32 routes. Every few miles, you check the GPS. If a route looks like it's going off a cliff (based on the GPS score), you stop that car immediately and send a new car down a different path. You keep the "good" cars and kill the "bad" ones early.

2. The Theory: Why Does It Work?

The researchers asked: "How do we know this 'Try, Check, and Prune' method actually works, and when does it fail?"

They found two main rules that determine success:

  • Rule #1: The "No Dead Ends" Rule (Action-Level Coverage).
    Imagine your GPS is so bad that it tells you to turn left when the only way forward is right. If the GPS is completely disconnected from reality, no amount of pruning will help. The paper proves that as long as the GPS isn't totally lying (it still gives some hint of the right direction), the method works.
  • Rule #2: The "Average GPS" Rule (Divergence).
    Even if the GPS is a bit noisy, if it's "mostly right" on average, the method can still find the destination. The researchers created a mathematical formula to predict exactly how many "cars" (samples) you need to keep to get a good answer based on how noisy your GPS is.

3. The Surprise: When the GPS is Perfect

You might think, "If my GPS is perfect, I don't need 32 cars; I just need one!"

  • The Reality: Surprisingly, even with a perfect GPS, the standard "Try, Check, and Prune" method (SMC) still needs a lot of cars to work well. It gets confused by its own math.
  • The Fix: The authors invented a new version called SMC-RS. Think of this as a "Super-Pruner." It doesn't just check the cars; it uses a special rejection technique that ensures even with a perfect GPS, you don't need to waste resources. It fixes the "confusion" of the standard method.

4. The Limit: The "Myopic" Problem

The paper also discovered a fundamental limit.
Imagine you are playing a game where you have to guess a secret code. You can only look at one letter at a time.

  • The Limit: If you are "myopic" (short-sighted), meaning you only look at the current letter and don't plan ahead, you will eventually get stuck no matter how many guesses you make.
  • The Lesson: To solve very long, complex problems efficiently, you can't just react to the present moment; you need a strategy that looks slightly further ahead. The paper proves that without this "foresight," you will always need an exponentially growing number of attempts as the problem gets longer.

5. The Real-World Test

The researchers tested this on Math Problems (like the AIME and Math500 benchmarks).

  • The Result: The "Try, Check, and Prune" method (SMC) consistently beat the "Best-of-N" method. It solved more problems correctly.
  • The Twist: Interestingly, they found that having a "perfect" GPS didn't always mean the best results. Sometimes, a slightly "noisy" GPS that was very strict about killing bad ideas early worked better than a "perfect" one that was too lenient. This suggests that for math problems, being decisive (killing bad paths fast) might be more important than being perfectly accurate.

Summary

This paper is like a manual for managing a team of explorers trying to find a hidden treasure.

  • Old way: Send everyone to the end of the world and pick the one who found the treasure.
  • New way (SMC): Send everyone out, but if they hit a wall, send them back and try a different path immediately.
  • The Science: The paper gives you the math to know exactly how many explorers you need and how good your map (GPS) needs to be to ensure you find the treasure without running out of food.

They proved that this method is mathematically sound, fixed some of its flaws, and showed that in the real world (solving math problems), it is a superior way to get smart AI models to think harder and better.