Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Imagine you are a hiring manager trying to find the perfect candidate for a job. You have a resume screening tool (the Reward Model) that gives every applicant a score. You also have a pool of candidates generated by an AI (the Reference Model).

The goal is to pick the absolute best candidate. But here's the catch: your screening tool isn't perfect. Sometimes it gives a high score to a candidate who looks great on paper but is actually terrible at the job (this is called Reward Hacking). Other times, it misses a genius candidate because their resume is weirdly formatted.

This paper tackles a specific problem: How do you pick the best candidate when you have to generate many options and your scoring tool is flawed?

The Two Extreme Approaches (The Old Ways)

Currently, people use two main strategies, both of which have big flaws:

The "Optimist" (Best-of-N):
- The Strategy: "Generate 100 candidates, look at their scores, and pick the one with the highest score!"
- The Analogy: Imagine a casino. The Optimist thinks, "If I play the slot machine 100 times, I'll eventually hit the jackpot."
- The Problem: If the slot machine is rigged (the reward model is flawed), the Optimist will keep hitting the "fake jackpot" (reward hacking). They pick the candidate who looks the best to the machine, but is actually a fraud.
The "Pessimist" (Regularized/Conservative):
- The Strategy: "Don't trust the high scores too much. Stick closer to the average candidates to be safe."
- The Analogy: This is like a cautious investor who refuses to buy any stock that has gone up too fast, fearing a crash.
- The Problem: They are so afraid of the "fake jackpot" that they miss out on the real geniuses. They play it so safe they never find the truly amazing candidate.

The Big Discovery: It Depends on the "Shape" of the Scores

The authors realized that the right strategy depends on the shape of the distribution of the scores. They call this the "Tail Behavior."

Light Tail (The "Needle in a Haystack"):
- Scenario: High scores are very rare. Most candidates are average, and only a few are truly great.
- Analogy: Finding a diamond in a pile of dirt.
- Best Strategy: You need to be an Optimist. You must dig deep and pick the highest score, because the "fake" high scores aren't that common. If you are too conservative, you'll never find the diamond.
Heavy Tail (The "Wild West"):
- Scenario: There are many candidates with extremely high scores, but many of them are fakes or flukes.
- Analogy: A carnival game where the machine is broken and gives out huge prizes to almost everyone, but most prizes are worthless.
- Best Strategy: You need to be a Pessimist. If you just pick the highest score, you'll get ripped off. You need to be skeptical and stick to the safer, more reliable options.

The Dilemma: In the real world, we don't know if a specific prompt (job description) is a "Needle in a Haystack" or a "Broken Carnival Game." If we use a fixed strategy (always Optimist or always Pessimist), we will fail half the time.

The Solution: "Best-of-Tails" (BoT)

The authors propose a new framework called Best-of-Tails (BoT). Think of this as a Smart Hiring Manager with a "Lie Detector."

Here is how BoT works, step-by-step:

Sample the Crowd: First, it generates a bunch of candidates (say, 100) and gets their scores.
Check the "Tail" (The Lie Detector): Before picking a winner, it looks at the top scores. It asks: "Are these top scores a rare, genuine spike (Light Tail), or are there too many high scores that look suspicious (Heavy Tail)?"
- It uses a statistical tool called the Hill Estimator (a fancy math way of measuring how "heavy" the tail is) to figure this out instantly.
Switch Modes:
- If the tail is Light: It switches to Optimist Mode. It aggressively picks the highest-scoring candidate because the risk of a fake high score is low.
- If the tail is Heavy: It switches to Pessimist Mode. It ignores the extreme outliers and picks a candidate that is high-scoring but more "average" and safe, avoiding the trap of the broken machine.
The Magic Glue (Tsallis Divergence): To make this switch smooth, they use a mathematical tool called Tsallis Divergence. Imagine this as a dimmer switch.
- At one end, it's pure Optimism (KL Divergence).
- At the other end, it's pure Pessimism (Chi-Squared Divergence).
- BoT turns the dimmer switch to the exact right setting based on what it saw in step 2.

Why This Matters

In the real world, some questions (like simple math problems) have "Light Tails" (the answer is either right or wrong, and high scores are rare and good). Other questions (like creative writing or open-ended advice) have "Heavy Tails" (LLMs can hallucinate and get high scores for nonsense).

BoT is the first system that automatically knows which game it's playing.

When it's safe to be greedy, it is greedy.
When it's dangerous to be greedy, it is cautious.

The Result

In their experiments, BoT consistently beat the "always optimistic" and "always pessimist" strategies. It found better answers in math tests, reasoning tasks, and human preference evaluations. It managed to find the "diamonds" without falling for the "fake prizes."

In short: Instead of forcing a hammer (Optimism) or a screwdriver (Pessimism) on every problem, BoT is a Swiss Army Knife that picks the right tool for the specific job at hand.

1. Problem Statement

Inference-time alignment aims to steer Large Language Models (LLMs) toward human preferences (e.g., correctness, safety) by generating multiple candidate responses and selecting the best one using a reward model (RM), without updating the model's weights. The dominant strategy, Best-of-N (BoN), selects the candidate with the highest proxy reward score.

However, current strategies face a fundamental trade-off:

Optimistic Strategies (e.g., BoN, Soft-BoN): These aggressively select high-reward candidates. While effective at finding high-quality outliers, they are prone to reward hacking (Goodhart's Law). As the number of samples ( $N$ ) increases, these methods over-optimize the imperfect proxy reward, leading to a degradation in true quality because the reward model is mis-calibrated in the extreme tails of the distribution.
Pessimistic Strategies (e.g., Inference-Time Pessimism - ITP): These use conservative regularization (e.g., $\chi^2$ divergence) to prevent over-optimization. While robust against reward hacking, they often stifle exploration, failing to discover genuinely high-quality responses when the reward signal is reliable and the distribution is "light-tailed."

The core problem is that existing methods use a fixed strategy (either always optimistic or always pessimistic) regardless of the specific statistical properties of the reward distribution for a given prompt.

2. Methodology: Best-of-Tails (BoT)

The authors propose Best-of-Tails (BoT), an adaptive framework that dynamically interpolates between optimistic and pessimistic selection rules based on the tail behavior of the reward distribution for each specific prompt.

Theoretical Foundation

Regret Minimization: The authors formalize the alignment problem through the lens of regret minimization. They derive an upper bound on inference-time regret that isolates the trade-off between Alignment Gain (finding better responses) and Distortion (deviating from the reference policy).
Tail Behavior Analysis:
- Light-Tailed Regimes: High-reward responses are rare ("needles in a haystack"). Here, aggressive exploration (optimism) is necessary to find them, and the risk of reward hacking is low.
- Heavy-Tailed Regimes: The reward distribution has a long tail of high scores, often due to reward model mis-calibration. Here, aggressive selection leads to severe distortion and reward hacking; conservative selection (pessimism) is required.
Tsallis Divergence: To bridge the gap between the exponential re-weighting of KL-divergence (optimistic/Soft-BoN) and the linear re-weighting of $\chi^2$ $χ^{2}$ -divergence (pessimistic/ITP), BoT utilizes Tsallis divergence of order $\alpha$ $α$ .
- The selection policy is defined as: $\pi_{BoT}(y|x) \propto \pi_{ref}(y|x) \exp_\alpha(\hat{r}(x, y)/\lambda)$ .
- As $\alpha \to 1$ , it recovers Soft-BoN (Optimistic).
- As $\alpha \to 2$ , it recovers ITP (Pessimistic).

Adaptive Mechanism

BoT does not use a fixed hyperparameter $\alpha$ . Instead, it estimates the tail heaviness of the reward distribution for each prompt and adjusts $\alpha$ accordingly:

Tail Estimation: It uses the Hill Estimator (from extreme value theory) on the top $K$ $K$ proxy reward scores from $N$ $N$ sampled candidates to estimate the tail index $\hat{\kappa}(x)$ $\overset{κ}{^} (x)$ .
- Small $\hat{\kappa}$ indicates a light tail.
- Large $\hat{\kappa}$ indicates a heavy tail.
Adaptive Interpolation: The parameter $\alpha$ $α$ is dynamically set using a mapping function:
$\alpha(x) = 1 + \frac{\hat{\kappa}(x)}{\hat{\kappa}(x) + \kappa_0}$
where $\kappa_0$ $κ_{0}$ is a pivot hyperparameter.
- If the tail is light ( $\hat{\kappa} \ll \kappa_0$ ), $\alpha \to 1$ (Optimistic).
- If the tail is heavy ( $\hat{\kappa} \gg \kappa_0$ ), $\alpha \to 2$ (Pessimistic).

3. Key Contributions

Theoretical Insight: The paper provides the first rigorous theoretical analysis linking the tail behavior of reward distributions to the optimality of inference-time alignment strategies. It proves that the optimal strategy is not universal but depends on whether the reward distribution is light-tailed or heavy-tailed.
Novel Framework (BoT): Introduction of a unified framework using Tsallis divergence that adaptively interpolates between optimistic and pessimistic strategies based on per-prompt tail statistics.
Efficient Estimation: Demonstration that estimating the tail index (a scalar) via the Hill estimator is significantly more sample-efficient than modeling the full reward distribution, making the approach practical for inference-time scaling.
Empirical Validation: Comprehensive experiments showing that BoT consistently outperforms fixed-strategy baselines across diverse tasks and model configurations.

4. Experimental Results

The authors evaluated BoT on four benchmarks: GSM8K (math), MMLU (multiple-choice), MATH (competition math), and AlpacaFarm (human preference). They tested various reference models (Gemma, Llama, Mistral) and reward models.

Performance vs. Sample Size ( $N$ ):
- Optimistic Baselines (BoN/sBoN): Performance improves initially but degrades as $N$ increases due to reward hacking (true reward drops while proxy reward rises).
- Pessimistic Baselines (ITP): Remain robust but saturate early, failing to leverage larger $N$ to find better solutions.
- BoT: Successfully navigates the trade-off. It achieves higher peak true rewards than ITP and avoids the degradation seen in BoN, maintaining high true reward even at large $N$ .
Adaptivity: Visualizations of the Hill estimator show that BoT correctly identifies heavy-tailed prompts (switching to $\alpha \approx 2$ ) and light-tailed prompts (switching to $\alpha \approx 1$ ), whereas fixed strategies fail to adapt to this heterogeneity.
Distortion Analysis: BoT achieves a better Pareto frontier between proxy reward and policy distortion (KL/ $\chi^2$ ) compared to static baselines.

5. Significance

Solving the Alignment Dilemma: BoT resolves the long-standing tension between exploration (finding the best answer) and exploitation (avoiding reward hacking) by making the strategy context-aware.
Inference-Time Scaling: As the community moves toward "scaling laws at inference time" (using more compute to generate more candidates), BoT provides a theoretically grounded method to ensure that increasing $N$ yields diminishing returns in quality rather than catastrophic failure.
Generalizability: The approach is model-agnostic and can be applied to any LLM with a reference policy and a reward model, requiring no fine-tuning of the base model.
Future Directions: The paper suggests that tail-adaptive robustness could be distilled into dense models or used to dynamically stop sampling for light-tailed prompts to save compute, offering a path toward more efficient and reliable LLM deployment.

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

The Two Extreme Approaches (The Old Ways)

The Big Discovery: It Depends on the "Shape" of the Scores

The Solution: "Best-of-Tails" (BoT)

Why This Matters

The Result

1. Problem Statement

2. Methodology: Best-of-Tails (BoT)

Theoretical Foundation

Adaptive Mechanism

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Faster Stochastic Algorithms for Minimax Optimization under Polyak--Łojasiewicz Conditions

Tensor Completion Leveraging Graph Information: A Dynamic Regularization Approach with Statistical Guarantees

Federated Multi-Agent Mapping for Planetary Exploration

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing