Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

Imagine you are the captain of a ship, and you have a fleet of five different pilots (Large Language Models, or LLMs) to choose from for your next voyage. You want to pick the absolute best one.

Traditionally, the "leaderboards" we see online act like a fixed scoreboard. They say: "Pilot A is #1, Pilot B is #2, and Pilot C is #3." They treat these rankings as absolute facts, like the final score of a basketball game.

The Problem:
The authors of this paper argue that this scoreboard is misleading. It's not a final score; it's more like a weather forecast based on a single, shaky thermometer.

Context Matters: Pilot A might be amazing at navigating storms (coding tasks) but terrible at navigating calm seas (creative writing). Pilot B might be the opposite. A single global ranking ignores the specific conditions of your trip.
The Noise Factor: The data used to create these rankings comes from human opinions, which are noisy and imperfect. Sometimes, the difference between "Pilot A is #1" and "Pilot B is #1" is just a fluke of the sample, not a real difference in skill.
The Danger: If you blindly follow a fixed leaderboard, you might send your ship into a storm with a pilot who is only "ranked #1" because of a statistical fluke, leading to a crash.

The Solution: The "Foggy Map" Approach

Instead of giving you a single, rigid line showing who is #1, this paper proposes a dynamic, uncertainty-aware map.

Think of it like this:

Old Way: A GPS that says, "You are here, and the best route is definitely Path A." (Even if the map is blurry).
New Way: A GPS that says, "For a short trip, Pilot A is definitely the best. But for a long, complex trip, the data is too fuzzy to tell who is better. So, here is a cloud of possibilities where Pilot A, B, and C are all tied for first place."

How It Works (The Metaphor):

Contextual Utility (The "Specialist" Lens):
Imagine the pilots have different tools. The paper builds a model that asks: "How good is Pilot A specifically for a 500-word creative story?" vs. "How good is Pilot A for a 2,000-word legal contract?"
The model realizes that as the "prompt" (the task) changes, the ranking changes. A pilot who is #1 for short tasks might drop to #5 for long, complex tasks.
Confidence Sets (The "Fog of War"):
This is the most important part. Instead of just saying "Pilot A is #1," the model draws a foggy circle around the answer.
- Clear Fog: If the data is strong, the circle is small. "We are 95% sure Pilot A is better than Pilot B."
- Thick Fog: If the data is weak (e.g., the task is very long and hard to judge), the circle expands. "We honestly don't know who is better. They could be anywhere from #1 to #5."
- The Result: The system admits when it doesn't know. It refuses to force a fake ranking when the evidence isn't there.

Why This Matters for You:

Stop Over-Reacting: If you see a model jump from #4 to #3 on a leaderboard, this new method says, "Wait, that's probably just noise. Don't switch your entire system based on that."
Smart Routing: If you are a company sending thousands of requests, you can route "creative writing" tasks to the model that is statistically proven to be best for that specific type, and route "math" tasks to a different one.
Safety First: When the "fog" is too thick (meaning the models are indistinguishable for a specific task), the system tells you: "Don't pick based on quality; pick based on cost or speed." It prevents you from making expensive mistakes based on fake precision.

In a Nutshell:
This paper teaches us to stop treating AI rankings like a final exam score and start treating them like a weather report. Sometimes the sun is out (clear dominance), but often it's foggy (uncertainty). The smartest decision-makers don't ignore the fog; they plan their journey knowing exactly how thick it is.

Here is a detailed technical summary of the paper "Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification" by Menendez, Liu, and Dai.

1. Problem Formulation

The paper addresses a critical flaw in current Large Language Model (LLM) evaluation systems: the reliance on point estimates for rankings derived from pairwise human preferences. Existing leaderboards (e.g., LMSys Arena) typically treat rankings as fixed, well-identified objects, ignoring two key realities:

Context Dependence: LLM performance varies significantly based on input prompts (e.g., length, semantic category, task type). A single global utility score fails to capture this heterogeneity.
Statistical Uncertainty: Rankings are derived from finite, noisy human judgments. Treating point estimates as absolute truths leads to "overconfident" decisions where apparent rank differences are actually statistically indistinguishable.

The Core Problem: How to construct decision-safe rankings for LLMs that are conditional on specific input prompts and provide statistically valid uncertainty guarantees (confidence sets) rather than just point estimates. The goal is to avoid misallocation and welfare loss in downstream systems (routing, procurement, selection) caused by acting on spurious orderings.

2. Methodology

The authors propose a framework based on a Contextual Bradley–Terry–Luce (BTL) Model combined with Simultaneous Confidence Intervals.

A. The Model

Contextual BTL: Instead of a static latent utility $\theta_m$ , the utility of model $m$ is a function of the prompt covariates $x$ :
$\theta_m(x) = \beta_{0m} + x^\top \beta_m$
Here, $\beta_{0m}$ represents intrinsic performance, and $\beta_m$ captures how performance changes with prompt characteristics (e.g., length, category).
Preference Probability: The probability that model $j$ is preferred over $i$ given prompt $x$ is:
$P(y=1 | x, (i,j)) = \frac{e^{\theta_j(x)}}{e^{\theta_j(x)} + e^{\theta_i(x)}}$

B. Estimation

Constrained Maximum Likelihood Estimation (MLE): The authors estimate parameters $\beta$ by maximizing the likelihood of observed pairwise preferences.
Identification: Since only utility differences are identifiable, they impose normalization constraints ( $\sum \beta_{0i} = 0, \sum \beta_i = 0$ ) to fix the reference level without affecting rankings.

C. Inference and Confidence Sets

The core innovation lies in moving from utility estimation to rank inference.

Simultaneous Confidence Intervals for Utility Differences:
- Instead of constructing intervals for individual utilities, they construct rectangular simultaneous confidence sets for the vector of pairwise utility differences $\Delta_{ij}(x) = \theta_j(x) - \theta_i(x)$ .
- They use max-type statistics and parametric bootstrapping to determine critical values that ensure simultaneous coverage across all pairs $(i, j)$ .
From Differences to Ranks:
- A pairwise ranking is "statistically resolved" if the confidence interval for the utility difference excludes zero.
- If the interval contains zero, the relative order is "statistically unresolved."
- Marginal and Simultaneous Rank Confidence Sets: By aggregating resolved and unresolved pairwise comparisons, the method constructs sets $R_L(x)$ representing the possible ranks a model can hold.
- Partial Identification: If data is insufficient to distinguish models, the output is a set (e.g., "Rank is between 2 and 5") rather than a forced point estimate. This naturally induces partial orders.

3. Key Contributions

Formalization of Prompt-Dependent Ranking: Treats rankings as random objects dependent on covariates within a contextual BTL framework, shifting from static leaderboards to dynamic, context-aware inference.
Valid Uncertainty Quantification for Ranks: Develops a rigorous statistical procedure to construct marginal and simultaneous confidence sets for ranks. This ensures correct asymptotic coverage, addressing the non-smooth nature of ranking functions (where small utility changes cause discrete rank jumps).
Decision-Theoretic Framework: Provides a principled way to make decisions under uncertainty. The framework allows decision-makers to exploit dominance when supported by data and avoid overconfident choices when rankings are ambiguous.

4. Empirical Results

The authors applied their framework to the Arena Human Preference 140k dataset involving 10 major LLMs.

Prompt Length Sensitivity:
- For short prompts, rankings are often well-identified (e.g., GPT-4 models are clearly top-ranked).
- As prompt length increases (beyond ~1127 tokens), uncertainty grows. The confidence sets widen until all models become statistically indistinguishable (ranking sets collapse to $[1, M]$ ). This suggests that for very long contexts, current models perform similarly, and point estimates are misleading.
Task Specialization (Category Analysis):
- Generalists: Models like ChatGPT-4o and DeepSeek-R1 show robust, high rankings across diverse categories with narrow confidence intervals.
- Specialists:
  - Grok-4: Dominates in "Creativity," "Domain Knowledge," and "Specificity" (singleton confidence interval at Rank 1) but performs poorly in coding tasks.
  - Qwen-Max: Excels in "Code" and "Math" but drops significantly in creative tasks.
- Intrinsic vs. Contextual: Global rankings often hide these specializations. When conditioning on specific categories, the rank order and uncertainty change drastically.
Uncertainty Awareness: Many apparent rank differences in standard leaderboards (e.g., Model A is Rank 2, Model B is Rank 3) are shown to be statistically unresolved (confidence intervals overlap), meaning the data does not support a strict ordering.

5. Significance and Implications

Economic and Computational Decision Making: The paper argues that acting on point-estimate rankings leads to misallocation (e.g., routing a query to a model that isn't actually better for that specific task) and welfare loss.
Robust Mechanism Design: By incorporating uncertainty, systems can implement robust selection rules. If a model is not statistically dominant, the system can default to cost, latency, or other non-performance metrics rather than forcing a suboptimal choice based on noise.
Beyond Leaderboards: The work challenges the validity of static LLM leaderboards, suggesting they should be replaced or augmented by dynamic, uncertainty-aware dashboards that reflect the specific context of the user's query.
Theoretical Advancement: It bridges the gap between contextual preference learning and statistical rank inference, providing a solution to the "non-smooth functional" problem in ranking statistics.

In summary, the paper provides a mathematically rigorous framework to move LLM evaluation from "who is the best model?" to "which model is best for this specific prompt, and how confident are we?"

Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

1. Problem Formulation

2. Methodology

A. The Model

B. Estimation

C. Inference and Confidence Sets

3. Key Contributions

4. Empirical Results

5. Significance and Implications

More like this

Comparison of Outlier Detection Algorithms on String Data

Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates

Interventional Time Series Priors for Causal Foundation Models

Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information

Graph Tokenization for Bridging Graphs and Transformers