Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Imagine you are trying to solve a very difficult riddle, like a complex math problem or a tricky logic puzzle. You have a smart friend (the Large Language Model, or LLM) who is good at talking, but sometimes they ramble, get stuck, or give you a wrong answer.

In the past, if you wanted a better answer, you might ask your friend to write the solution 32 times and then pick the best one. This is called "Best-of-N." It works, but it's wasteful. You might generate 31 bad answers just to find one good one.

This paper introduces a smarter way to do this, called Sequential Monte Carlo (SMC), which is basically a fancy way of saying "Try, Check, and Prune."

Here is the simple breakdown of what the researchers did, using some everyday analogies.

1. The Problem: The "Bad GPS"

Imagine you are driving a car (the LLM) to a destination (the correct answer). You have a GPS (the Process Reward Model or PRM) that tells you how close you are to the goal at every turn.

The Catch: Your GPS is imperfect. Sometimes it says you are close when you are actually far away, or vice versa.
The Old Way (Best-of-N): You drive 32 different routes from start to finish. At the end, you look at all 32 destinations and pick the one that looks most like the goal. You wasted a lot of gas driving the 31 wrong routes all the way to the end.
The New Way (SMC): You start driving 32 routes. Every few miles, you check the GPS. If a route looks like it's going off a cliff (based on the GPS score), you stop that car immediately and send a new car down a different path. You keep the "good" cars and kill the "bad" ones early.

2. The Theory: Why Does It Work?

The researchers asked: "How do we know this 'Try, Check, and Prune' method actually works, and when does it fail?"

They found two main rules that determine success:

Rule #1: The "No Dead Ends" Rule (Action-Level Coverage).
Imagine your GPS is so bad that it tells you to turn left when the only way forward is right. If the GPS is completely disconnected from reality, no amount of pruning will help. The paper proves that as long as the GPS isn't totally lying (it still gives some hint of the right direction), the method works.
Rule #2: The "Average GPS" Rule (Divergence).
Even if the GPS is a bit noisy, if it's "mostly right" on average, the method can still find the destination. The researchers created a mathematical formula to predict exactly how many "cars" (samples) you need to keep to get a good answer based on how noisy your GPS is.

3. The Surprise: When the GPS is Perfect

You might think, "If my GPS is perfect, I don't need 32 cars; I just need one!"

The Reality: Surprisingly, even with a perfect GPS, the standard "Try, Check, and Prune" method (SMC) still needs a lot of cars to work well. It gets confused by its own math.
The Fix: The authors invented a new version called SMC-RS. Think of this as a "Super-Pruner." It doesn't just check the cars; it uses a special rejection technique that ensures even with a perfect GPS, you don't need to waste resources. It fixes the "confusion" of the standard method.

4. The Limit: The "Myopic" Problem

The paper also discovered a fundamental limit.
Imagine you are playing a game where you have to guess a secret code. You can only look at one letter at a time.

The Limit: If you are "myopic" (short-sighted), meaning you only look at the current letter and don't plan ahead, you will eventually get stuck no matter how many guesses you make.
The Lesson: To solve very long, complex problems efficiently, you can't just react to the present moment; you need a strategy that looks slightly further ahead. The paper proves that without this "foresight," you will always need an exponentially growing number of attempts as the problem gets longer.

5. The Real-World Test

The researchers tested this on Math Problems (like the AIME and Math500 benchmarks).

The Result: The "Try, Check, and Prune" method (SMC) consistently beat the "Best-of-N" method. It solved more problems correctly.
The Twist: Interestingly, they found that having a "perfect" GPS didn't always mean the best results. Sometimes, a slightly "noisy" GPS that was very strict about killing bad ideas early worked better than a "perfect" one that was too lenient. This suggests that for math problems, being decisive (killing bad paths fast) might be more important than being perfectly accurate.

Summary

This paper is like a manual for managing a team of explorers trying to find a hidden treasure.

Old way: Send everyone to the end of the world and pick the one who found the treasure.
New way (SMC): Send everyone out, but if they hit a wall, send them back and try a different path immediately.
The Science: The paper gives you the math to know exactly how many explorers you need and how good your map (GPS) needs to be to ensure you find the treasure without running out of food.

They proved that this method is mathematically sound, fixed some of its flaws, and showed that in the real world (solving math problems), it is a superior way to get smart AI models to think harder and better.

Here is a detailed technical summary of the paper "Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference."

1. Problem Statement

Large Language Models (LLMs) have demonstrated significant performance gains through inference-time interventions, specifically methods that generate multiple parallel samples, aggregate them, and prune low-quality ones (e.g., Sequential Monte Carlo or SMC). While empirically effective, these methods lack a principled theoretical framework to explain their accuracy-cost tradeoffs.

The core challenge addressed is: How accurately can we sample from a target distribution $\pi^*$ (tilted by a reward function) given a base model $\pi_{ref}$ and an imperfect Process Reward Model (PRM) $\hat{V}$ , using a limited number of parallel evaluations?

Existing theoretical guarantees for such methods are scarce. Most prior work focuses on sequential approaches (like backtracking) or assumes perfect reward models. This paper seeks to rigorously analyze parallel particle filtering methods (like SMC) under realistic conditions where the PRM is imperfect.

2. Methodology and Theoretical Framework

The authors frame the problem using Sequential Monte Carlo (SMC) algorithms, treating LLM generation as a Markov chain where particles (partial generations) are propagated, weighted by the PRM, and resampled.

Key Definitions

Target Distribution ( $\pi^*$ ): The distribution of sequences proportional to the base model's probability times the true reward $r^*$ .
Process Reward Model ( $\hat{V}$ ): An approximate value function estimating the expected terminal reward of a partial sequence.
Approximate Target ( $\hat{\pi}_h$ ): The intermediate distribution induced by the imperfect PRM.

Theoretical Contributions

The paper establishes three main theoretical pillars:

A. Simple Criteria for SMC Success (Theorem 1.1 & 3.2)
The authors identify two structural properties that guarantee the success of SMC:

Bounded Action-Level Coverage ( $C_{act}$ ): The ratio between the target conditional probability and the base model's conditional probability must be bounded. This ensures the base model does not assign zero probability to high-reward paths that the target distribution requires.
Bounded $\chi^2$ -Divergence ( $C_{\chi^2}$ ): The divergence between the true intermediate target $\pi^*_h$ and the approximate target $\hat{\pi}_h$ (induced by $\hat{V}$ ) must be bounded. This controls the error introduced by the imperfect PRM.

Result: Under these conditions, SMC with $N$ particles achieves a Total Variation (TV) distance error of $O\left(\sqrt{\frac{H^2 C_{act} (C_{\chi^2}+1)}{N}}\right)$ . This bound is tighter and more general than previous results for sequential backtracking methods, and SMC offers a parallel runtime advantage ( $O(H)$ vs. $O(H^2)$ ).

B. Algorithmic Improvements: SMC with Rejection Sampling (SMC-RS)
The authors identify a fundamental limitation of standard SMC: even with a perfect PRM ( $\hat{V} = V^*$ ), standard SMC requires $\Omega(\sqrt{H})$ particles to achieve non-trivial accuracy due to "interference" between particles during resampling.

Solution: They propose SMC-RS (Algorithm 2), which wraps SMC in a rejection sampling loop.
Benefit: SMC-RS avoids the $\sqrt{H}$ particle requirement. If the PRM is near-perfect, SMC-RS can achieve $o(1)$ sampling error with only $O(1)$ particles, matching the efficiency of ideal sequential rejection sampling while maintaining parallelism.

C. Fundamental Limits of Myopic Methods (Theorem 3.9)
The paper proves a lower bound for myopic particle filtering methods (algorithms that do not use future PRM information to decide current particle sets).

Result: Even with mild PRM imperfections, any myopic method requires at least $\Omega(\log H / \log \log H)$ particles to maintain coverage of the target distribution.
Implication: This suggests that to achieve sub-linear complexity in the horizon $H$ , algorithms must incorporate lookahead mechanisms, which current standard SMC implementations lack.

3. Empirical Results

The authors validate their theory through experiments on both synthetic "prompt-switching" tasks and real-world math reasoning benchmarks (AIME, Math500).

A. Prompt-Switching Task (Synthetic Validation)

Setup: The base model $\pi_{ref}$ generates text for prompt $P_{ref}$ , while the target $\pi^*$ corresponds to prompt $P^*$ . The PRM $\hat{V}$ is manipulated to vary in accuracy.
Findings:
- Action-Level Coverage: There is a strong positive correlation between action-level coverage (measured via KL-divergence proxy) and SMC sampling error. Lower coverage leads to higher error.
- PRM Accuracy: There is a strong correlation between the divergence $D_{KL}(\pi^*_h \| \hat{\pi}_h)$ and sampling error. As the PRM becomes less accurate, SMC performance degrades, validating Theorem 1.1.
- Particle Count: SMC consistently outperforms Sequential Importance Sampling (SIS) and Best-of-N baselines, with performance improving as the number of particles $N$ increases.

B. Math Reasoning Tasks (Real-World Application)

Setup: Applied SMC to solve math problems using a base model (Qwen2.5-1B) and a PRM (Qwen2.5-Math-PRM-7B).
Findings:
- Superiority over Best-of-N: SMC with $N=32$ particles outperforms Best-of-N on the vast majority of individual problems in the Math500 and AIME datasets, not just on average.
- Theoretical vs. Empirical Gap: Interestingly, the theoretical metric ( $\chi^2$ -divergence between $\pi^*$ and $\hat{\pi}$ ) did not correlate positively with final accuracy on these hard math problems. In some cases, larger divergence (worse PRM fit) led to higher accuracy.
- Explanation: The authors hypothesize that for discrete success/failure tasks (like math), the goal is not to perfectly approximate the entire target distribution, but to ensure the algorithm covers at least one correct mode. A "noisier" PRM might effectively prune incorrect paths more aggressively, even if it distorts the distributional shape.

4. Key Contributions

Rigorous Theory for Parallel Inference: Provides the first non-asymptotic guarantees for SMC in LLM inference, linking performance to action-level coverage and PRM accuracy.
Algorithmic Innovation: Introduces SMC-RS, a variant that eliminates the $\sqrt{H}$ particle bottleneck of standard SMC when the PRM is accurate.
Lower Bounds: Proves that myopic particle filtering inherently requires logarithmic scaling in the horizon, highlighting the need for lookahead strategies.
Unified Perspective: Demonstrates a coupling between the backtracking method (VGB) and particle filtering, showing they are theoretically related but SMC offers parallel speedups.

5. Significance and Future Directions

This paper bridges the gap between the empirical success of parallel inference methods (like SMC) and theoretical understanding. It provides practitioners with concrete metrics (coverage and divergence) to predict when SMC will succeed or fail.

Open Questions Raised:

Distributional vs. Utility Metrics: The theory predicts distributional closeness (TV distance), but empirical results on math tasks suggest that "usefulness" (finding any correct answer) may not require distributional accuracy. Future work needs metrics that capture this "utility" rather than strict distributional fidelity.
Lookahead: Since myopic methods have fundamental limits, developing efficient lookahead mechanisms for LLM inference is a critical open problem.
Heavy-Tailed Errors: The paper handles heavy-tailed PRM errors via coverage bounds, but optimizing for these in practice remains an area for further research.

In summary, the paper establishes that while SMC is a powerful tool for steering LLMs, its success is governed by specific structural properties of the reward model and the base policy. It offers both improved algorithms (SMC-RS) and a clear theoretical roadmap for understanding the limits of parallel reasoning.