Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

This paper introduces Best-of-Tails (BoT), an adaptive inference-time alignment framework that dynamically balances optimistic and pessimistic selection strategies by characterizing reward distribution tail heaviness via the Hill estimator and using Tsallis divergence as a tunable regularizer to optimize performance across diverse reasoning and preference tasks.

Hsiang Hsu, Eric Lei, Chun-Fu Chen

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are a hiring manager trying to find the perfect candidate for a job. You have a resume screening tool (the Reward Model) that gives every applicant a score. You also have a pool of candidates generated by an AI (the Reference Model).

The goal is to pick the absolute best candidate. But here's the catch: your screening tool isn't perfect. Sometimes it gives a high score to a candidate who looks great on paper but is actually terrible at the job (this is called Reward Hacking). Other times, it misses a genius candidate because their resume is weirdly formatted.

This paper tackles a specific problem: How do you pick the best candidate when you have to generate many options and your scoring tool is flawed?

The Two Extreme Approaches (The Old Ways)

Currently, people use two main strategies, both of which have big flaws:

  1. The "Optimist" (Best-of-N):

    • The Strategy: "Generate 100 candidates, look at their scores, and pick the one with the highest score!"
    • The Analogy: Imagine a casino. The Optimist thinks, "If I play the slot machine 100 times, I'll eventually hit the jackpot."
    • The Problem: If the slot machine is rigged (the reward model is flawed), the Optimist will keep hitting the "fake jackpot" (reward hacking). They pick the candidate who looks the best to the machine, but is actually a fraud.
  2. The "Pessimist" (Regularized/Conservative):

    • The Strategy: "Don't trust the high scores too much. Stick closer to the average candidates to be safe."
    • The Analogy: This is like a cautious investor who refuses to buy any stock that has gone up too fast, fearing a crash.
    • The Problem: They are so afraid of the "fake jackpot" that they miss out on the real geniuses. They play it so safe they never find the truly amazing candidate.

The Big Discovery: It Depends on the "Shape" of the Scores

The authors realized that the right strategy depends on the shape of the distribution of the scores. They call this the "Tail Behavior."

  • Light Tail (The "Needle in a Haystack"):

    • Scenario: High scores are very rare. Most candidates are average, and only a few are truly great.
    • Analogy: Finding a diamond in a pile of dirt.
    • Best Strategy: You need to be an Optimist. You must dig deep and pick the highest score, because the "fake" high scores aren't that common. If you are too conservative, you'll never find the diamond.
  • Heavy Tail (The "Wild West"):

    • Scenario: There are many candidates with extremely high scores, but many of them are fakes or flukes.
    • Analogy: A carnival game where the machine is broken and gives out huge prizes to almost everyone, but most prizes are worthless.
    • Best Strategy: You need to be a Pessimist. If you just pick the highest score, you'll get ripped off. You need to be skeptical and stick to the safer, more reliable options.

The Dilemma: In the real world, we don't know if a specific prompt (job description) is a "Needle in a Haystack" or a "Broken Carnival Game." If we use a fixed strategy (always Optimist or always Pessimist), we will fail half the time.

The Solution: "Best-of-Tails" (BoT)

The authors propose a new framework called Best-of-Tails (BoT). Think of this as a Smart Hiring Manager with a "Lie Detector."

Here is how BoT works, step-by-step:

  1. Sample the Crowd: First, it generates a bunch of candidates (say, 100) and gets their scores.
  2. Check the "Tail" (The Lie Detector): Before picking a winner, it looks at the top scores. It asks: "Are these top scores a rare, genuine spike (Light Tail), or are there too many high scores that look suspicious (Heavy Tail)?"
    • It uses a statistical tool called the Hill Estimator (a fancy math way of measuring how "heavy" the tail is) to figure this out instantly.
  3. Switch Modes:
    • If the tail is Light: It switches to Optimist Mode. It aggressively picks the highest-scoring candidate because the risk of a fake high score is low.
    • If the tail is Heavy: It switches to Pessimist Mode. It ignores the extreme outliers and picks a candidate that is high-scoring but more "average" and safe, avoiding the trap of the broken machine.
  4. The Magic Glue (Tsallis Divergence): To make this switch smooth, they use a mathematical tool called Tsallis Divergence. Imagine this as a dimmer switch.
    • At one end, it's pure Optimism (KL Divergence).
    • At the other end, it's pure Pessimism (Chi-Squared Divergence).
    • BoT turns the dimmer switch to the exact right setting based on what it saw in step 2.

Why This Matters

In the real world, some questions (like simple math problems) have "Light Tails" (the answer is either right or wrong, and high scores are rare and good). Other questions (like creative writing or open-ended advice) have "Heavy Tails" (LLMs can hallucinate and get high scores for nonsense).

BoT is the first system that automatically knows which game it's playing.

  • When it's safe to be greedy, it is greedy.
  • When it's dangerous to be greedy, it is cautious.

The Result

In their experiments, BoT consistently beat the "always optimistic" and "always pessimist" strategies. It found better answers in math tests, reasoning tasks, and human preference evaluations. It managed to find the "diamonds" without falling for the "fake prizes."

In short: Instead of forcing a hammer (Optimism) or a screwdriver (Pessimism) on every problem, BoT is a Swiss Army Knife that picks the right tool for the specific job at hand.