When LLM Judge Scores Look Good but Best-of-N Decisions Fail

The Big Idea: The "Good Average" Trap

Imagine you are a hiring manager. You have a new AI assistant that helps you screen job applicants. You ask the AI to rate 100 different candidates on a scale of 1 to 100.

You check the AI's performance by comparing its ratings to your own final hiring decisions. You calculate a "correlation score" (a measure of how well the AI agrees with you). The score is 0.47. In the world of data, that looks "okay." It's not perfect, but it's not terrible. You think, "Great, this AI is good enough to help me pick the best candidate!"

The paper says: Stop. You are being fooled.

While the AI agrees with you on average, it is actually terrible at picking the single best candidate for any specific job opening. If you use this AI to pick the winner from a group of 4 applicants, it will only succeed about 21% of the time. The other 79% of the time, it's basically guessing.

The Analogy: The "Classroom Test" vs. The "Race"

To understand why this happens, let's use a classroom analogy.

The Scenario:
You have 5,000 different math tests (prompts). On each test, there are 4 students (candidates) trying to solve the same problem.

The Oracle (Truth): You know exactly who got the best score on every single test.
The Judge (AI): The AI gives a score to every student on every test.

The Deception (Global Correlation):
The AI is very good at noticing how hard the test is.

On an easy test, all students get high scores (90s). The AI gives them all 90s.
On a hard test, all students get low scores (40s). The AI gives them all 40s.

Because the AI correctly predicts that "Easy tests = High scores" and "Hard tests = Low scores," its overall agreement with you looks great. It's like a weather forecaster who is right 90% of the time because they just say "It's summer" every day. They are right about the season, but they can't tell you if it's going to rain today.

The Failure (Within-Prompt Ranking):
The real job isn't to know if the test is hard or easy. The job is to look at the 4 students on a specific test and say, "Student A is slightly better than Student B."

In the paper's data, the AI is terrible at this.

The Tie Problem: The AI only uses about 20 different score numbers (like 1, 2, 3... up to 20). When two students are very close in quality, the AI often gives them the exact same score (e.g., both get a "15").
The Result: When the AI gives a tie, it can't pick a winner. It has to flip a coin. Since it ties 67% of the time, it ends up picking the winner by random chance most of the time.

The "Best-of-N" Problem

In the real world, we often use AI to do Best-of-N selection. This means:

Ask an AI to generate 4 different answers to a question.
Ask a "Judge AI" to score them.
Pick the highest-scoring one.

The paper found that even if the Judge AI has a "decent" global score, it fails at this specific task because it can't distinguish between the "good" answers and the "great" answers when they are all on the same prompt. It's like a judge who can tell the difference between a Ferrari and a bicycle, but can't tell the difference between a Ferrari and a slightly faster Ferrari.

The Solution: Change the Game

The authors tested a few ways to fix this:

1. Stop Scoring, Start Comparing (Pairwise Judging)
Instead of asking the AI, "Rate this answer from 1 to 100," ask it, "Which is better: Answer A or Answer B?"

Result: This worked much better! By forcing the AI to make a direct choice, it stopped giving ties. Its ability to pick the winner jumped from 21% to 61%.
Why: It's easier for humans (and AIs) to say "A is better than B" than to assign a perfect number to A and a perfect number to B.

2. Check the Right Metrics
Don't just look at the "Global Correlation" (the overall agreement). You need to look at:

Recovery Rate: How much better is the AI's choice compared to just picking randomly?
Tie Rate: How often does the AI say "I don't know, they are equal"? If this number is high, the AI is useless for picking winners.

The Takeaway for Everyone

If you are building AI systems or using them to make decisions:

Don't trust the headline number. A "good" correlation score doesn't mean the AI is good at picking the best option.
Watch out for ties. If your AI keeps giving the same score to different options, it's not actually making a decision; it's just guessing.
Ask the right questions. If you need to pick a winner, ask the AI to compare options directly (A vs. B) rather than asking it to grade them individually.

In short: A judge that is good at grading the difficulty of the exam is not necessarily good at picking the top student. To pick the top student, you need a judge that can see the tiny differences between the best candidates, not just the big differences between easy and hard tests.

1. Problem Statement

The paper addresses a critical failure mode in the deployment of Large Language Models (LLMs) as judges for Best-of-N (BoN) selection, reranking, and model iteration.

The Common Practice: Practitioners typically validate LLM judges using a single global metric, such as the correlation ( $r$ ) between judge scores and reference labels (oracle scores) across a dataset. A moderate correlation (e.g., $r \approx 0.5$ ) is often interpreted as sufficient evidence that the judge is safe to use for optimization.
The Failure: The authors demonstrate that global agreement does not guarantee local decision validity. A judge can have high global correlation driven by prompt-level baseline effects (e.g., agreeing that some prompts are generally "hard" and others "easy") while failing to distinguish between candidates within a specific prompt.
The Consequence: In a deployment scenario where the goal is to select the single best candidate from $N$ options for a specific prompt, a judge with moderate global correlation may perform barely better than random selection, capturing only a tiny fraction of the potential utility gain.

2. Methodology and Experimental Setup

The study employs a rigorous decomposition of judge performance to isolate the signal relevant to decision-making.

Dataset: A cross-policy benchmark of 5,000 prompts from Chatbot Arena. For each prompt, there are 4 candidate responses generated by different policies (including Llama-3.3-70B and Llama-405B variants).
Oracle: Reference labels are pre-existing, normalized response-level scores (0–1) bundled with the dataset, serving as the ground truth ( $O_{x,i}$ ).
Judge: A fixed snapshot of GPT-5 (gpt-5-2025-08-07) used to score candidates on a 0–100 scale.
Key Metric Decomposition:
- Global Correlation ( $r$ ): Correlation across all (prompt, candidate) pairs.
- Within-Prompt Correlation ( $r_{within}$ ): Correlation of residuals after removing prompt-level means. This measures the judge's ability to rank candidates relative to each other within the same context.
- Recovery Rate: The fraction of the utility gap between random selection and oracle-optimal selection that the judge recovers.
  $\text{Recovery} = \frac{E[O_{judge}] - E[O_{random}]}{E[O_{oracle}] - E[O_{random}]}$
- Tie Analysis: Examination of how coarse score discretization (20 bins) leads to ties, rendering selection random.
Comparative Audits:
- Pointwise vs. Pairwise: Comparing standard pointwise scoring against explicit pairwise comparison prompts.
- Multi-Judge: Replicating results across 5 different judge models (GPT-5.2, Claude Sonnet 4, GPT-4.1-mini, Gemini-2.5-flash, Llama-3.3-70B).
- Routing Analysis: Testing if confidence signals (margins) can be used to route difficult prompts to an expensive oracle.

3. Key Contributions

The paper makes four primary contributions to the evaluation of LLM judges:

Decision-Centric Audit: Proposes shifting focus from global agreement to Recovery Rate and Top-1 Accuracy (PCS), which directly measure the utility of the selection process.
Within-vs-Between Decomposition: Formalizes the separation of baseline effects (prompt difficulty) from candidate quality. It proves that global correlation is often dominated by the former, masking the weakness of the latter.
Tie Mechanism Analysis: Identifies that coarse score quantization (e.g., 20 bins) creates massive tie rates (67% in pairwise comparisons), effectively randomizing the selection of the best candidate.
Pairwise Judging as a Remedy: Demonstrates that explicit pairwise prompting significantly reduces ties and recovers lost signal in Best-of-2 scenarios, though this does not automatically translate to Best-of-4 without careful budgeting.

4. Key Results

A. The "Looks Good" Gap

In the main 5,000-prompt Best-of-4 benchmark:

Global Correlation: $r = 0.47$ (appears moderate/acceptable).
Within-Prompt Correlation: $r_{within} = 0.27$ (weak).
Recovery Rate: Only 21.0%. The judge captures just 21% of the improvement that an oracle would provide over random selection.
Top-1 Accuracy: 31.6%.
Tie Rate: 66.5% of pairwise comparisons result in ties due to the 20-bin discretization.

B. Variance Decomposition

74% of the variance in judge scores and 81% in oracle scores comes from between-context (prompt-level) effects.
Only 26% (judge) and 19% (oracle) of the variance is within-context (candidate-level).
Global correlation is inflated by the agreement on prompt difficulty, which is irrelevant for selecting the best candidate within a prompt.

C. Pairwise vs. Pointwise

In a Best-of-2 matched-pair audit, switching from pointwise scoring to explicit pairwise comparison reduced the tie rate from 59.8% to 3.9%.
Recovery Rate jumped from 21.1% to 61.2%.
Caveat: In a strict Best-of-4 round-robin setting with token budgets, pairwise methods did not universally outperform pointwise methods, indicating that the benefit is regime-dependent.

D. Calibration and Routing Failures

Calibration: Applying isotonic calibration improved global correlation but had negligible effect on recovery or directional accuracy. Calibration cannot fix a lack of ranking information.
Routing: Using "margin" (score difference) to route uncertain prompts to an oracle failed. High-margin prompts were often "easy" prompts with low value for improvement, while "hard" prompts had low margins but high potential gain.
Uncertainty Elicitation: Explicitly asking judges for confidence intervals (CI width) or using resampling provided a much stronger signal ( $r \approx 0.26$ ) for routing than simple margins ( $r \approx 0$ ).

E. Generalizability

The pattern holds across different judge families (OpenAI, Anthropic, Google, Meta) and different evaluation regimes (including PPE-MATH with binary correctness labels), confirming that the "level-vs-decision" gap is a fundamental structural issue, not an artifact of a specific model or dataset.

5. Significance and Practical Implications

The paper fundamentally challenges how LLM judges are evaluated and deployed:

For System-Level Benchmarking: Global metrics (correlation) may still be appropriate for comparing average model performance across many prompts.
For Instance-Level Optimization (RLHF, Reranking): Global metrics are misleading. Teams must report within-prompt metrics ( $r_{within}$ , tie rates, recovery, and Top-1 accuracy).
Thresholds for Deployment: The authors suggest that for Best-of-4 to be practically useful, $r_{within}$ should be at least 0.40 (currently, the tested judge was at 0.27).
Actionable Advice:
1. Audit judges in the "hard regime" (near-neighbor candidates), not mixed-difficulty benchmarks.
2. Use explicit pairwise prompting to reduce ties in Best-of-2 scenarios.
3. Do not rely on score margins for routing; use explicit uncertainty elicitation (CIs) or resampling.
4. Recognize that calibration fixes scale, not ranking; if the judge cannot distinguish candidates, calibration will not help.

In conclusion, the paper argues that aggregate validity does not imply decision validity. A judge can be "correct on average" while being useless for the specific task of picking the best response for a single user query.

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

The Big Idea: The "Good Average" Trap

The Analogy: The "Classroom Test" vs. The "Race"

The "Best-of-N" Problem

The Solution: Change the Game

The Takeaway for Everyone

1. Problem Statement

2. Methodology and Experimental Setup

3. Key Contributions

4. Key Results

A. The "Looks Good" Gap

B. Variance Decomposition

C. Pairwise vs. Pointwise

D. Calibration and Routing Failures

E. Generalizability

5. Significance and Practical Implications

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank