Ranking Reasoning LLMs under Test-Time Scaling

Imagine you are a talent scout trying to find the best chess player in the world. In the old days, you might just watch them play one game. If they win, they are great; if they lose, they are not.

But today, Large Language Models (LLMs) are like super-computers that can think in many different ways. Sometimes they get lucky, sometimes they get stuck. If you only watch them play one game, you might get a bad read on their true skill.

This paper is about a new way to judge these AI models. Instead of watching them play once, we ask them to solve the same math problem 80 times (this is called "Test-Time Scaling"). Now, we have a massive pile of results. The big question becomes: How do we rank them fairly when we have so much data?

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "One-and-Done" Trap

Imagine you are judging a cooking contest. If you only taste one dish from each chef, you might think a chef is amazing because they got lucky with a perfect sear, or terrible because they dropped a spoon.

Old Way: Taste one dish per chef.
New Way (Test-Time Scaling): Ask each chef to cook the same dish 80 times. Now you have a huge buffet of data.
The Issue: With 80 dishes per chef, how do you decide who is truly the best? Do you just count how many dishes were edible? Do you look at who beat whom in head-to-head comparisons? Different counting methods can give you different winners, even with the same data!

2. The Solution: "Scorio" (The Universal Scorekeeper)

The authors built a software tool called Scorio (think of it as a super-sophisticated referee whistle). It implements dozens of different ways to count the votes and rank the chefs.

They tested 20 different AI "chefs" on 4 very hard math competitions (like the Olympics for math).
They ran each AI 80 times on every problem.
They used Scorio to see which ranking method gave the most consistent and fair results.

3. The Findings: What Works Best?

A. The "Gold Standard" (The Full Buffet)

When you have all 80 attempts for every model, most ranking methods agree on who is the best. It's like if you tasted all 80 dishes, everyone would agree on the top chef.

The Winner: The simplest method often works best: just count the total number of correct answers and divide by the total attempts. The authors call this BayesU@80. It's the "average accuracy" of the whole buffet.

B. The "Low Budget" Problem (The Taste Test)

In the real world, you can't always afford to run a model 80 times. Maybe you only have time for one try (N=1). This is like tasting just one dish per chef.

The Challenge: With only one taste, rankings are shaky. One chef might get a lucky win, and another might have a bad day.
The Discovery: Some ranking methods are much more stable than others when data is scarce.
- The "Shrinkage" Trick: The paper found a clever trick. If you have a "greedy" version of the AI (a version that plays it safe and doesn't take risks), you can use its single attempt as a "hint" or a prior.
- The Metaphor: Imagine you are judging a gymnast. You only see one routine (low budget). But you know from watching them practice (the "greedy" prior) that they are usually steady. You use that practice knowledge to smooth out the weirdness of the single routine.
- The Catch: This trick works great if the practice version and the real version agree. If the "safe" practice version is actually bad at the specific task, using it as a hint will bias your ranking and make you pick the wrong winner.

C. The "Voting" vs. "Head-to-Head" Debate

The paper tested many different ways to count:

Voting Rules: "Who won the most questions?" (Like a popularity contest).
Head-to-Head: "Who beat whom more often?" (Like a sports league).
Graph Methods: "Who is the most central node in the web of wins?" (Like PageRank for Google).
Result: When you have a lot of data (80 trials), all these methods usually agree. But when you have very little data, they start to disagree wildly. The paper tells us which ones to trust when we are in a hurry.

4. The Big Takeaway

If you have the budget to run an AI many times, just use the average score. It's simple, fair, and everyone agrees on it.

But if you are on a tight budget (only one try), you need to be careful.

Use methods that are known to be stable with little data.
If you use a "hint" (like a greedy prior) to help you rank, make sure that hint actually matches the real behavior of the AI. If the hint is wrong, it will lead you to the wrong conclusion.

Summary in One Sentence

This paper provides a rulebook and a toolkit (Scorio) to help us fairly rank AI models when we can't afford to test them thousands of times, teaching us when to trust simple averages and when to use clever "hints" without getting tricked.

Here is a detailed technical summary of the paper "Ranking Reasoning LLMs under Test-Time Scaling".

1. Problem Statement

Large Language Models (LLMs) are increasingly evaluated using test-time scaling, a paradigm where multiple outputs are sampled per prompt (e.g., via stochastic decoding) and aggregated to improve reasoning performance. While this approach enhances absolute performance, it transforms model evaluation into a repeated-sampling problem.

The core challenge addressed in this paper is ranking models in this regime. Unlike single-run benchmarks, test-time scaling generates a tensor of outcomes rather than a single score. The authors identify several critical gaps:

Methodological Ambiguity: It is unclear which statistical ranking rules (e.g., voting, pairwise comparison, IRT) are most reliable when the number of trials ( $N$ ) is small (low-budget) versus large.
Lack of Gold Standard: There is no universally agreed-upon "ground truth" for ranking methods, as different methods may converge to different orderings even with infinite data.
Stability vs. Accuracy: Methods that are stable with few trials may not align with the true performance ordering, and vice versa.

2. Methodology

2.1 Formalization: The Response Tensor

The authors formalize the benchmark setting using a response tensor $R \in \{0, 1\}^{L \times M \times N}$ :

$L$ : Number of models.
$M$ : Number of questions.
$N$ : Number of independent trials per model-question pair.
$R_{lmn} = 1$ if model $l$ solves question $m$ on trial $n$ .

This dense representation contrasts with sparse pairwise comparison datasets (like Chatbot Arena) and allows for the application of diverse statistical families.

2.2 The Scorio Library

The authors introduce Scorio, an open-source Python library implementing over 70 ranking methods. These methods transform the tensor $R$ into model scores via three primary representations:

Pointwise: Aggregating per-question solve rates (e.g., Mean Accuracy, Inverse-Difficulty weighting).
Pairwise: Reducing outcomes to win/tie counts between model pairs to fit models like Bradley-Terry, Elo, Glicko, and TrueSkill.
Setwise/Listwise: Treating the set of winners/losers per trial as a single event (e.g., Plackett-Luce, Davidson-Luce).
Graph/Spectral: Constructing comparison graphs and using centrality measures (e.g., PageRank, Rank Centrality, HodgeRank, $\alpha$ -Rank).
Voting Rules: Treating questions as voters (e.g., Borda, Copeland, Schulze, Nanson).
Item Response Theory (IRT): Estimating latent abilities and item difficulties (e.g., Rasch, 2PL, 3PL).

2.3 Evaluation Protocol

To evaluate these methods, the authors define two key metrics:

Gold-Standard Agreement: Measured by Kendall's $\tau_b$ against a reference ranking. The chosen gold standard is BayesU@80, the Bayesian posterior mean estimator with a uniform prior computed from all $N=80$ trials. This is order-equivalent to mean accuracy but supports priors and uncertainty.
Low-Budget Stability: Measured by how well a ranking computed from a subsampled $N=1$ (single trial) agrees with the full-trial reference.
Convergence: How quickly rankings stabilize as $N$ increases from 1 to 80.

2.4 Empirical Priors

The study investigates the use of greedy decoding (deterministic output) as an empirical prior ( $R_0$ ) to stabilize low-budget rankings. This results in BayesR0@N, which incorporates greedy outcomes as pseudo-counts in the Bayesian estimator.

3. Key Contributions

Formalization of Dense Benchmark Ranking: The paper provides a rigorous mathematical framework for ranking under test-time scaling, unifying various statistical approaches under the response tensor $R$ .
Scorio Library: The release of a comprehensive, open-source library implementing diverse ranking families, enabling reproducibility and future research.
Empirical Analysis of 72 Methods: A large-scale evaluation across 20 reasoning models and four Olympiad-style math benchmarks (AIME'24/25, HMMT'25, BrUMO'25) with up to $N=80$ trials.
Identification of Reliable Methods:
- High-Budget ( $N=80$ ): Most reasonable ranking families (Bayesian, IRT, Graph-based) converge to nearly identical orderings, with a mean Kendall's $\tau_b$ of 0.93–0.95 against the gold standard.
- Low-Budget ( $N=1$ ): Significant divergence occurs. The best methods achieve $\tau_b \approx 0.86$ .
Analysis of Empirical Priors: The study quantifies the bias-variance trade-off of using greedy decoding as a prior.
- Variance Reduction: Using a greedy prior reduces variance by 16–52% at $N=1$ .
- Bias Risk: If greedy decoding and stochastic sampling disagree (low alignment), the prior introduces systematic bias, potentially degrading ranking accuracy on difficult benchmarks.

4. Key Results

Convergence: On easier benchmarks (e.g., BrUMO'25), methods like BayesR0@N and HodgeRank achieve near-perfect agreement ( $\tau_b = 1.0$ ) with the gold standard at full budget. On harder benchmarks (HMMT'25), Bradley-Terry and HodgeRank maintain perfect agreement, while greedy-prior methods drop slightly ( $\tau_b \approx 0.989$ ).
Low-Budget Performance:
- BayesR0@N is the top performer on easier benchmarks, achieving $\tau_b \approx 0.86$ at $N=1$ .
- On the hardest benchmark (HMMT'25), BayesU@N (uniform prior) and a class of 21 graph/voting methods perform best ( $\tau_b \approx 0.79$ ), as the greedy prior fails to align with stochastic sampling.
Self-Consistency vs. Gold Standard: High self-consistency (a method agreeing with its own full-trial ranking) does not guarantee agreement with the gold standard. For example, Nanson's rule was highly self-consistent but had lower agreement with the accuracy-based gold standard.
Categorical Ranking: Extending the framework to categorical outcomes (e.g., partial credit, format awareness) reveals a trade-off: schemes using rich auxiliary signals (like verifiers) are highly self-consistent but often deviate from the correctness-based gold standard, introducing systematic bias.

5. Significance and Implications

Practical Guidance for Evaluation: The paper provides concrete recommendations for practitioners:
- Default Choice: BayesU@N (uniform prior) is a robust default for high-budget evaluations.
- Low-Budget Strategy: BayesR0@N (greedy prior) is effective for reducing variance in low-budget settings only if greedy decoding aligns well with stochastic sampling (checked via a pilot study).
- Avoidance: Methods like Minimax variants and certain difficulty-weighted baselines show poor agreement and should be avoided.
Theoretical Insight: The work demonstrates that different ranking methods target different statistical quantities (marginal accuracy vs. pairwise latent strength). Therefore, they are not guaranteed to converge to the same ordering, necessitating a defined gold standard for evaluation.
Community Resource: By releasing Scorio, the authors lower the barrier to entry for rigorous LLM evaluation, allowing researchers to easily compare and implement complex statistical ranking methods.

In summary, this paper establishes a rigorous statistical foundation for evaluating LLMs under test-time scaling, demonstrating that while most methods agree at high budgets, careful selection of ranking rules and priors is critical for reliable low-budget evaluation.