Best-of-$\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling

Imagine you are trying to solve a very difficult riddle. You have a team of smart friends (Large Language Models, or LLMs) who can give you answers. Sometimes, one friend is right, but often they might guess wrong or get confused.

This paper is about a new, super-smart way to use your team of friends to get the right answer without wasting too much time or energy.

Here is the breakdown of their three big ideas, explained simply:

1. The "Best-of-∞" Dream vs. Reality

The Concept:
Imagine asking your friends for the answer to a riddle over and over again. If you ask them 1,000 times, the answer that comes up most often is almost certainly the correct one. If you asked them infinity times, you would be 100% sure. This is called Best-of-∞.

The Problem:
Asking infinity times takes forever and costs a fortune in computer power. You can't actually do that in the real world.

The Solution (Adaptive Sampling):
Instead of asking a fixed number of times (like "ask 10 times no matter what"), the authors built a smart referee.

The Analogy: Think of a referee watching a coin flip. If the coin lands on "Heads" three times in a row, the referee stops and declares "Heads!" immediately. But if the coin is flipping back and forth (Heads, Tails, Heads, Tails), the referee keeps flipping until they are absolutely sure which side is winning.
How it works: The computer generates answers one by one. It uses a mathematical "confidence meter" (called a Bayes Factor). As soon as the answers agree enough to be confident, it stops.
The Result: For easy questions, it stops very quickly (saving time). For hard questions, it keeps going until it's sure. This saves a massive amount of computing power while still getting the same high accuracy as asking 100 times.

2. The "All-Star Team" (Ensembling)

The Concept:
Usually, people pick the single "smartest" friend and just use them. But what if your "smartest" friend is great at math but bad at science, and your "second-best" friend is the opposite?

The Solution:
The authors suggest mixing your team. Instead of picking just one, you let several different models answer the question, and you take the majority vote.

The Analogy: Imagine a sports team. If you only have one superstar striker, you might lose if they get injured or have an off day. But if you have a balanced team with a great striker, a great defender, and a great goalie, they cover each other's weaknesses.
The Magic: The paper shows that a mix of a "good" model and a "great" model can actually beat the "great" model alone. They complement each other.

3. The "Perfect Mix" Formula (MILP)

The Concept:
If you have a team of five friends, how much should you listen to each one? Should you listen to Friend A 50% of the time and Friend B 10%? Or maybe 30% each?
Finding the perfect balance is a math nightmare because the relationship isn't a straight line (it's "non-concave," which is a fancy way of saying "it's tricky and bumpy").

The Solution:
The authors turned this tricky problem into a Lego puzzle (specifically, a Mixed-Integer Linear Program).

The Analogy: Imagine you have a map of a city with different zones. Some zones are "Safe Zones" where a specific mix of friends will get the answer right. The goal is to find the exact spot on the map where the most "Safe Zones" overlap.
How it works: They used a powerful computer solver to find the perfect percentage for each model. This ensures that for every type of problem, the team is weighted exactly right to get the best possible score.

The Big Picture Results

The authors tested this on some of the hardest math and science puzzles (like the AIME math competition).

Efficiency: Their "smart referee" method got the same high scores as asking 100 times, but only used about 10 to 20 questions on average. That's a 2x to 5x saving in computer time.
Performance: By mixing different models with their "perfect formula," they created a team that was smarter than any single model in the group.

In short: They figured out how to ask a team of AI models the right number of times to be sure of the answer, and how to mix different models together so they act like a super-brain that is better than the sum of its parts.

Here is a detailed technical summary of the paper "Best-of-∞– Asymptotic Performance of Test-Time LLM Ensembling" by Junpei Komiyama, Daisuke Oba, and Masafumi Oyamada.

1. Problem Statement

The paper addresses the challenge of improving Large Language Model (LLM) reasoning performance through test-time scaling, specifically the Best-of-N (BoN) strategy.

The Limit: While generating $N$ answers and selecting the best one via majority voting improves accuracy as $N \to \infty$ (denoted as Best-of-∞), a literal implementation requires infinite computational resources, which is infeasible in practice.
The Gap: Existing methods either use a fixed $N$ (wasting compute on easy problems or under-sampling hard ones) or rely on reward models/LLM-as-a-judge, which introduce additional inference costs and potential biases (e.g., reward hacking).
The Goal: The authors aim to approximate the asymptotic performance of Best-of-∞ with a finite, adaptive budget and extend this framework to ensembles of multiple LLMs to leverage complementary strengths.

2. Methodology

The proposed framework consists of two main components: an Adaptive Sampling Scheme for single models and an Optimal Weighting Formulation for ensembles.

A. Adaptive Sampling (Approximating Best-of-∞)

Instead of generating a fixed number of samples ( $N$ ), the authors propose an algorithm that stops generation adaptively once statistical confidence in the majority answer is reached.

Bayesian Modeling: The method models the unknown distribution of LLM answers using a Dirichlet Process (DP) prior, $DP(H, \alpha)$ , which handles the uncertainty of the answer space (finite or infinite support).
Stopping Criterion: The algorithm uses the Bayes Factor (BF) to quantify evidence.
- Hypothesis $H_1$ : The most frequent answer observed so far is the true majority.
- Hypothesis $H_0$ : It is not.
- The process continues sampling until $BF(n) \geq B$ (a threshold), indicating sufficient evidence that the current mode is the true majority.
Theoretical Guarantee: The paper proves that as the maximum sample limit ( $N_{max}$ ) and Bayes factor threshold ( $B$ ) approach infinity, the algorithm converges almost surely to the true Best-of-∞ performance.

B. LLM Ensemble with Optimal Weighting

The framework extends to ensembles of $K$ different LLMs. Instead of uniform voting, the authors seek an optimal weight vector $w = (w_1, \dots, w_K)$ to maximize accuracy.

The Challenge: Optimizing weights for finite $N$ is combinatorially hard due to the exponential number of answer combinations.
The Asymptotic Insight: In the Best-of-∞ limit, the answer for a problem becomes deterministic based on the weighted sum of probabilities. The region where a specific answer wins forms a polytope in the weight simplex.
MILP Formulation: Maximizing the number of correct answers is equivalent to finding a weight vector $w$ $w$ that lies in the maximum number of these polytopes. This is formulated as a Mixed-Integer Linear Program (MILP):
- Objective: Maximize $\sum y_q$ (number of correct problems).
- Constraints: Linear inequalities ensuring the weighted probability of the gold answer exceeds that of any wrong answer for each problem $q$ .
- Max-Margin Solution: To ensure robustness for finite $N$ , the authors introduce a "max-margin" solution, pushing the weight vector to the interior of the optimal polytope intersection.

3. Key Contributions

Adaptive Best-of-∞ Approximation: A principled, Bayesian stopping rule (Algorithm 1) that dynamically allocates compute based on problem difficulty, achieving Best-of-∞ accuracy with significantly fewer samples than fixed- $N$ approaches.
Computationally Tractable Ensemble Optimization: The first formulation of optimal ensemble weights for LLM majority voting as a MILP. This allows for finding provably optimal weights in the asymptotic limit, a task previously considered intractable for finite $N$ .
Demonstration of Complementarity: Empirical evidence showing that weighted ensembles of heterogeneous LLMs can outperform the best single model, even if the individual models are weaker, by leveraging diverse reasoning strengths.
Large-Scale Dataset: The release of a massive dataset containing >600,000 generated answers across 11 LLMs and 4 reasoning benchmarks (AIME2024/2025, GPQA-DIAMOND, MATH500), with at least 80 generations per model-problem pair.

4. Experimental Results

The authors evaluated their methods on 11 open-weight LLMs (ranging from 4B to 32B parameters) across four heavy-reasoning benchmarks.

Adaptive Sampling Efficiency:
- The adaptive method achieved the same accuracy as a fixed $N=100$ approach using an average of only $N \approx 10$ samples (a 10x reduction in token usage).
- For easier problems, the algorithm stopped early (e.g., $N \approx 3$ ), while allocating more samples to difficult, ambiguous problems.
Ensemble Performance:
- MILP-optimized ensembles consistently outperformed both uniform weighting and the best single model.
- Case Study (AIME2025): An ensemble of GPT-OSS-20B (90.0% Best-of-∞) and Nemotron-Nano-9B (73.0% Best-of-∞) achieved 93.3% accuracy, surpassing the strongest individual model.
- The optimal weights were found to be non-trivial (e.g., assigning significant weight to weaker models if they had complementary strengths).
Comparison with Other Selection Methods:
- Majority voting outperformed LLM-as-a-judge, Reward Models, and Self-certainty in the Best-of-5 setting.
- Reward models and LLM-judges incurred extra inference costs without matching the accuracy of simple majority voting.
Transferability: Weights learned on AIME2024 generalized well to AIME2025, maintaining high performance across different problem sets.

5. Significance and Impact

Efficiency: The work demonstrates that adaptive inference is superior to fixed-budget scaling. It allows practitioners to achieve near-optimal accuracy while drastically reducing computational costs (tokens and time).
Theoretical Advancement: By reducing the complex problem of ensemble weighting to a solvable MILP in the asymptotic limit, the paper provides a rigorous mathematical foundation for combining LLMs, moving beyond heuristic or greedy selection strategies.
Practical Utility: The findings suggest that for complex reasoning tasks, diversity in model selection combined with optimal weighting is more effective than simply scaling up a single model or using expensive reward models.
Resource Sharing: The release of the large-scale generation dataset fills a gap in the community, providing a standardized benchmark for test-time scaling research.

In conclusion, the paper establishes that Best-of-∞ is a reachable target through adaptive sampling and that optimal ensemble weighting via MILP is a powerful, computationally feasible strategy to maximize LLM reasoning capabilities.

Best-of-∞\infty∞ -- Asymptotic Performance of Test-Time LLM Ensembling

1. The "Best-of-∞" Dream vs. Reality

2. The "All-Star Team" (Ensembling)

3. The "Perfect Mix" Formula (MILP)

The Big Picture Results

1. Problem Statement

2. Methodology

A. Adaptive Sampling (Approximating Best-of-∞)

B. LLM Ensemble with Optimal Weighting

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Explainable machine learning for predicting shellfish toxicity in the Adriatic Sea using long-term monitoring data of HABs

Talking like Piping and Instrumentation Diagrams (P&IDs)

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving

A Temporal-Spectral Fusion Transformer with Subject-Specific Adapter for Enhancing RSVP-BCI Decoding

Best-of- $\infty$ -- Asymptotic Performance of Test-Time LLM Ensembling