Best-of-\infty -- Asymptotic Performance of Test-Time LLM Ensembling

This paper analyzes the asymptotic performance of best-of-NN LLM ensembling via majority voting as NN \to \infty, proposing an adaptive generation scheme to efficiently allocate inference budgets and an optimal weighted ensemble method formulated as a mixed-integer linear program to outperform individual models.

Junpei Komiyama, Daisuke Oba, Masafumi Oyamada

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a very difficult riddle. You have a team of smart friends (Large Language Models, or LLMs) who can give you answers. Sometimes, one friend is right, but often they might guess wrong or get confused.

This paper is about a new, super-smart way to use your team of friends to get the right answer without wasting too much time or energy.

Here is the breakdown of their three big ideas, explained simply:

1. The "Best-of-∞" Dream vs. Reality

The Concept:
Imagine asking your friends for the answer to a riddle over and over again. If you ask them 1,000 times, the answer that comes up most often is almost certainly the correct one. If you asked them infinity times, you would be 100% sure. This is called Best-of-∞.

The Problem:
Asking infinity times takes forever and costs a fortune in computer power. You can't actually do that in the real world.

The Solution (Adaptive Sampling):
Instead of asking a fixed number of times (like "ask 10 times no matter what"), the authors built a smart referee.

  • The Analogy: Think of a referee watching a coin flip. If the coin lands on "Heads" three times in a row, the referee stops and declares "Heads!" immediately. But if the coin is flipping back and forth (Heads, Tails, Heads, Tails), the referee keeps flipping until they are absolutely sure which side is winning.
  • How it works: The computer generates answers one by one. It uses a mathematical "confidence meter" (called a Bayes Factor). As soon as the answers agree enough to be confident, it stops.
  • The Result: For easy questions, it stops very quickly (saving time). For hard questions, it keeps going until it's sure. This saves a massive amount of computing power while still getting the same high accuracy as asking 100 times.

2. The "All-Star Team" (Ensembling)

The Concept:
Usually, people pick the single "smartest" friend and just use them. But what if your "smartest" friend is great at math but bad at science, and your "second-best" friend is the opposite?

The Solution:
The authors suggest mixing your team. Instead of picking just one, you let several different models answer the question, and you take the majority vote.

  • The Analogy: Imagine a sports team. If you only have one superstar striker, you might lose if they get injured or have an off day. But if you have a balanced team with a great striker, a great defender, and a great goalie, they cover each other's weaknesses.
  • The Magic: The paper shows that a mix of a "good" model and a "great" model can actually beat the "great" model alone. They complement each other.

3. The "Perfect Mix" Formula (MILP)

The Concept:
If you have a team of five friends, how much should you listen to each one? Should you listen to Friend A 50% of the time and Friend B 10%? Or maybe 30% each?
Finding the perfect balance is a math nightmare because the relationship isn't a straight line (it's "non-concave," which is a fancy way of saying "it's tricky and bumpy").

The Solution:
The authors turned this tricky problem into a Lego puzzle (specifically, a Mixed-Integer Linear Program).

  • The Analogy: Imagine you have a map of a city with different zones. Some zones are "Safe Zones" where a specific mix of friends will get the answer right. The goal is to find the exact spot on the map where the most "Safe Zones" overlap.
  • How it works: They used a powerful computer solver to find the perfect percentage for each model. This ensures that for every type of problem, the team is weighted exactly right to get the best possible score.

The Big Picture Results

The authors tested this on some of the hardest math and science puzzles (like the AIME math competition).

  • Efficiency: Their "smart referee" method got the same high scores as asking 100 times, but only used about 10 to 20 questions on average. That's a 2x to 5x saving in computer time.
  • Performance: By mixing different models with their "perfect formula," they created a team that was smarter than any single model in the group.

In short: They figured out how to ask a team of AI models the right number of times to be sure of the answer, and how to mix different models together so they act like a super-brain that is better than the sum of its parts.