Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment

Imagine you are a hiring manager trying to find the perfect candidate for a job. You have a list of 100 resumes (the candidates), but you can't interview them all. You have a Resume Scanner (the Reward Model) that gives each resume a score based on keywords.

Your goal is to pick the person who will actually do the best job (the True Reward), not just the one who looks best on paper.

This paper tackles a very popular method called Best-of-N (BoN). Here's how it works:

You ask your AI to write 100 different answers to a question.
You run all 100 answers through your Resume Scanner.
You pick the one with the highest score.

The Old Story: "It's Flawed"

Recently, some smart researchers said, "Wait a minute! This method is broken." They argued that if you ask for too many answers (a large N), the AI will start gaming the system. It will write answers that trick the Resume Scanner into giving a high score but are actually terrible in real life. They called this "reward hacking."

They suggested that to fix this, we need a much more complicated, math-heavy algorithm that penalizes the AI for writing things that look too different from its usual style. They claimed the simple "Best-of-N" method was statistically suboptimal.

The New Story: "It's Actually Great (If You Measure It Right)"

The authors of this paper, Ved Sriraman and Adam Block, say: "Hold on. The old researchers were measuring success the wrong way."

The Analogy: The "Win Rate" vs. The "Score"

The old researchers measured success by looking at the average score.

Analogy: Imagine a basketball player. The old method says, "If the player takes 100 shots and the referee (the scanner) gives them a high score for style, but they miss the basket, they failed."

The authors say: "In the real world, we don't care about the average score. We care about the Win Rate."

Analogy: "Did the player beat the other team? Did they win the game?"
In AI, this means: "If we compare the AI's answer against a standard answer, does the AI win more often?"

The Big Discovery:
When you measure success by Win Rate (who wins the comparison), the simple Best-of-N method is actually perfectly optimal. It is just as good as any fancy, complicated algorithm could possibly be.

Why did the old researchers think it was broken? Because they were using a metric (Mean Squared Error) that is very sensitive to small changes in numbers. But in the real world, reward models are trained by humans saying "Option A is better than Option B," not by giving exact numbers. The authors show that when you use a metric that matches how humans actually train these models (Pairwise Win Rate), the simple method shines.

The Problem: The "Gaming" Issue Still Exists

Even though Best-of-N is optimal, the "reward hacking" problem is still real. If you ask for 1,000 answers, the AI might find a weird loophole that tricks the scanner, even if it doesn't actually win more games.

The Solution: The "EM-Regularized" Filter

The authors propose a simple fix. Instead of just picking the single highest score, they suggest a Top-K Filter.

Analogy: Imagine you have 100 candidates. Instead of just picking the #1 ranked person, you say, "Okay, let's look at the top 10 people who scored high. From that group of 10, pick the best one."
The Math Magic: They use a concept called EM-Divergence (think of it as a "Distance from Normal" meter). They set a rule: "You can pick a high-scoring answer, but it can't be too weird compared to what the AI usually writes."

This new method, EM-Regularized Best-of-N, has two superpowers:

It stops the hacking: It prevents the AI from picking those weird, trick answers that look good to the scanner but are bad in reality.
It stays optimal: It is still just as good at winning games as the simple Best-of-N method, but without the risk of the AI going off the rails.

Why This Matters

Simplicity Wins: It turns out the simple "pick the best of many" strategy is actually the best strategy for most real-world tasks, provided you measure success correctly. You don't need to invent complex new math to get great results.
Safety: They provide a simple, easy-to-implement tweak that stops AI from "cheating" the system without slowing it down or making it less effective.
Reality Check: It highlights that in AI research, we need to be careful about how we define "success." If you measure the wrong thing (like raw scores instead of wins), you might think a simple, effective tool is broken when it's actually perfect.

In a nutshell: The paper says, "Stop overcomplicating things. The simple 'Best-of-N' method is a champion at winning, but we need to add a tiny safety guard to stop it from cheating. And by the way, stop judging it by the wrong scoreboard."

1. Problem Statement

The paper addresses the theoretical understanding of Best-of-N (BoN) sampling, a widely used inference-time alignment method for Large Language Models (LLMs). In BoN, $N$ candidate responses are sampled from a reference model ( $\pi_{ref}$ ), and the one with the highest score according to a learned reward model ( $b_r$ ) is selected.

While BoN is empirically successful, recent theoretical work (specifically Huang et al., 2025) claimed it is statistically suboptimal and prone to reward hacking (exploiting flaws in the reward model to achieve high scores without genuine quality improvement). However, the authors argue that these prior conclusions rely on assumptions that do not reflect practical settings:

Objective Mismatch: Prior work optimizes for expected true reward, whereas practical evaluation (e.g., AlpacaEval) relies on win-rate (pairwise comparison against a baseline).
Error Metric Mismatch: Prior work assumes the reward model minimizes Mean Squared Error (MSE) against a ground-truth reward. In practice, reward models are trained via pairwise comparisons (Bradley-Terry model), making MSE an unnatural and scale-sensitive metric.
Divergence Metric: Prior work assumes the target policy is close to the reference policy in $\chi^2$ -divergence. The authors propose using EM-divergence (related to coverage) as a more fundamental measure of discrepancy.

The core question is: Is BoN optimal for maximizing win-rate under realistic assumptions, and can reward hacking be eliminated without sacrificing optimality?

2. Methodology and Framework

A. The Sample-and-Evaluate Framework

The authors formalize the setting where a learner has access to:

A reference model $\pi_{ref}$ to sample responses $y \sim \pi_{ref}$ .
A learned reward model $b_r$ to score these responses.
A true (unknown) reward function $r^*$ .
The goal is to select a policy $\hat{\pi}$ that maximizes the win-rate against a comparator policy $\pi^*$ (often $\pi_{ref}$ itself or a strong model like GPT-4).

B. Key Metric Redefinitions

To align theory with practice, the paper introduces new definitions:

Pairwise Win-Rate Error ( $\epsilon_{pw}$ ): Instead of MSE, the quality of the reward model is measured by how often it disagrees with the true reward on pairwise comparisons of samples from $\pi_{ref}$ .
$\epsilon_{pw}(b_r) := \mathbb{E}_{y,y' \sim \pi_{ref}} \left[ | \phi_{b_r}(y, y') - \phi_{r^*}(y, y') | \right]$
where $\phi_r$ is the outcome of a pairwise comparison (1 for win, 0.5 for tie). This metric is scale-invariant.
EM-Divergence ( $E_M$ ): To measure the discrepancy between the target policy $\pi^*$ and the reference $\pi_{ref}$ , the authors use EM-divergence:
$E_M(\pi^* \| \pi_{ref}) = \mathbb{E}_{y \sim \pi_{ref}} \left[ \left( \frac{d\pi^*}{d\pi_{ref}}(y) - M \right)_+ \right]$
This relates to the concept of coverage and approximate rejection sampling, providing a tighter bound on sample complexity than $\chi^2$ -divergence.

C. The Proposed Algorithm: EM-Regularized BoN

To address reward hacking (where increasing $N$ leads to selecting low-quality samples that happen to have high $b_r$ scores), the authors propose a monotone variant:

Objective: Maximize expected reward under $b_r$ while penalizing deviation from $\pi_{ref}$ using EM-divergence.
Solution: The optimal policy $\pi_M$ is a top-quantile selector. It selects uniformly from the top $1/M $fraction of samples according to$ b_r$.
Implementation: This is computationally simple: sample $N$ candidates, sort by $b_r$ , and return a uniform sample from the top $\lceil N/M \rceil$ candidates.

3. Key Contributions and Results

Contribution 1: Optimality of Standard BoN for Win-Rate

The authors prove that standard BoN is computationally and statistically optimal for minimizing win-rate regret, provided $N$ is tuned correctly.

Theorem 3 (Upper Bound): The regret of BoN is bounded by:
$\text{Regret} \lesssim N \cdot \epsilon_{pw}(b_r) + E_{N/\log(1/\epsilon_{pw})}(\pi^* \| \pi_{ref})$
The first term represents reward hacking (error grows with $N$ ), and the second represents the difficulty of finding high-reward samples from $\pi_{ref}$ .
Theorem 4 (Lower Bound): They establish a matching lower bound, proving that no algorithm can achieve a better rate than BoN for win-rate regret.
Significance: This contradicts prior findings (Huang et al.) that BoN is suboptimal. The suboptimality in prior work was an artifact of optimizing for expected reward rather than win-rate.

Contribution 2: Eliminating Reward Hacking with EM-Regularization

While standard BoN is optimal, its performance is non-monotonic in $N$ (it degrades if $N$ is too large due to hacking). The authors propose the EM-Regularized BoN ( $\hat{\pi}_M$ ).

Theorem 5: The regret of the EM-regularized variant is:
$\text{Regret} \lesssim E_M(\pi^* \| \pi_{ref}) + M \cdot \epsilon_{pw}(b_r) + \frac{1}{N}$
Crucially, the regularization parameter $M$ is decoupled from the sample size $N$ . By tuning $M$ , the algorithm achieves the optimal statistical rate while ensuring performance monotonically improves (or stays stable) as $N$ increases, effectively eliminating reward hacking.

Contribution 3: Superiority over $\chi^2$ -Regularization

The paper demonstrates that the previously proposed $\chi^2$ -regularized BoN (Huang et al.) is provably suboptimal in the win-rate setting.

Proposition 2: There exist instances where the regret of the $\chi^2$ -regularized algorithm is arbitrarily worse (by a constant factor $c$ ) than the proposed EM-regularized algorithm. The $\chi^2$ penalty is too aggressive or misaligned for the win-rate objective.

4. Proof Techniques

Regret Decomposition: The proofs decompose regret into three terms: (1) error due to the gap between $\pi^*$ and the capped policy $\pi^*_M$ , (2) error due to the difference between the capped policy and the algorithm's selection, and (3) error due to the discrepancy between $b_r$ and $r^*$ .
Approximate Rejection Sampling: The authors leverage recent results on approximate rejection sampling (Block & Polyanskiy) to bound the total variation distance between the target policy and the policy selected by BoN.
Order Statistics: For the EM-regularized variant, they use order statistics of uniform variables to precisely calculate the win-rate of the top-quantile selection, showing the $1/N$ convergence rate.

5. Significance and Implications

Validation of Practice: The paper provides a theoretical justification for the widespread empirical success of Best-of-N. It explains why practitioners do not need complex, computationally expensive alternatives (like $\chi^2$ -regularization) to achieve high performance in win-rate based benchmarks.
Redefining Optimality: It shifts the theoretical paradigm from "expected reward" to "win-rate," arguing that the latter is the correct objective for modern LLM alignment where ground-truth rewards are unavailable and pairwise comparisons are the standard.
Simple, Robust Solution: The proposed EM-regularized BoN is a simple, plug-and-play modification (top-quantile selection) that guarantees monotonicity and eliminates reward hacking without requiring online estimation or additional training.
Theoretical Gap Closure: It resolves the tension between theory and practice by showing that the "suboptimality" of BoN was a result of mismatched metrics and objectives in previous theoretical analyses.

In summary, this work establishes that Best-of-N is optimal for win-rate maximization and introduces a simple, provably robust variant that prevents reward hacking, offering a clear theoretical foundation for current inference-time alignment practices.