Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Here is an explanation of the paper "Rescaling Confidence" using simple language and creative analogies.

The Big Idea: The "Ruler" Problem

Imagine you ask a very smart, but slightly quirky, robot to tell you how sure it is about an answer. You ask, "On a scale of 0 to 100, how confident are you?"

The robot says, "I'm 95% sure!"

You ask another question. It says, "I'm 95% sure!"

You ask a third. "95% sure!"

It turns out that for most Large Language Models (LLMs), the "0 to 100" scale isn't a smooth, continuous ruler. It's more like a broken ruler with only three or four marks on it. The robot ignores almost all the numbers between 0 and 100 and just picks the same "round" numbers (like 90, 95, or 100) over and over again.

This paper asks: Is the way we ask the question (the scale) actually messing up the robot's ability to tell us how sure it really is?

The answer is a loud YES.

The Analogy: The "Pizza Slice" vs. The "Whole Pie"

Think of the model's confidence as a pizza.

The Standard Scale (0–100): You tell the robot, "Cut this pizza into 100 tiny slices and tell me how many slices you think you have."
- What happens: The robot gets confused. Instead of counting carefully, it just grabs the biggest, most obvious slices it knows (like the "95" slice or the "100" slice). It ignores the tiny, specific slices in between. It's like trying to measure a room with a tape measure that only has marks for "1 foot," "5 feet," and "10 feet." You can't get a precise measurement.
The Discovery: The researchers found that if you give the robot a smaller, simpler pizza (a scale of 0 to 20), it actually does a better job of telling you the truth. It stops guessing and starts thinking more clearly.

The Three Experiments (The "What If" Tests)

The researchers played "Mad Libs" with the confidence scale to see what happened. They changed three things:

1. The Granularity (How many numbers?)

The Test: They tried scales of 0–5, 0–10, 0–20, 0–50, and the standard 0–100.
The Result: The standard 0–100 scale was actually the worst at helping the robot be accurate.
The Winner: The 0–20 scale was the "Goldilocks" zone. It wasn't too simple (like 0–5) and not too complicated (like 0–100). When the robot had to choose between 0 and 20, it stopped defaulting to "95" and started giving a more honest, nuanced answer.

2. The Boundary Shifting (Moving the start line)

The Test: They told the robot, "Don't use 0 to 100. Use 60 to 100." They tried to force the robot to use the lower numbers by taking them away.
The Result: The robot refused to play along. Even when told "60 is the lowest you can go," the robot kept clustering its answers right at the top (near 100).
The Metaphor: It's like telling a shy person, "You can only speak if you shout." They don't suddenly become confident; they just get louder and louder, ignoring the middle ground. The robot treats numbers like lexical tokens (words it has seen a million times) rather than mathematical values. It loves the number "100" because it sees it a lot in its training data, not because it's actually 100% sure.

3. The Weird Scales (Breaking the rules)

The Test: They gave the robot weird ranges, like 0 to 73 or 3 to 38. They wanted to see if the robot would stop using "round numbers" (multiples of 5 or 10) if those numbers weren't at the ends of the scale.
The Result: The robot still loved round numbers! Even in a range of 3 to 38, it kept picking 35 or 30.
The Takeaway: The robot has a "habit" of picking round numbers, just like a human might always order the "medium" size even if they are asked to pick a size between 1 and 7. It's a bias built into its brain (or rather, its code).

Why Does This Matter?

Currently, when we build AI systems, we trust the robot's "confidence score" to decide if we should listen to it.

If the robot says "95% sure," we assume it's very confident.
But this paper shows that "95%" might just be the robot's favorite default setting, not a real calculation.

If we keep using the 0–100 scale, we are getting bad data. We think the robot is calibrated (accurate), but it's actually just "rounding off" its thoughts.

The Solution: Change the Ruler

The paper suggests three simple rules for anyone using AI confidence:

Use a 0–20 Scale: Instead of asking for 0–100, ask for 0–20. It forces the robot to be more specific and less likely to just default to "100."
Don't Trust the "Perfect" Score: If a robot says it's 100% sure, it might just be because "100" is the top of the ruler, not because it's actually perfect.
Check the Distribution: Before you trust the numbers, look at the histogram. If 80% of the answers are "95" or "100," the scale is broken, and the data is useless.

The Bottom Line

The way we ask a question changes the answer. By treating the "confidence scale" as a neutral tool, we've been accidentally tricking the AI into lying (or at least, being lazy). By switching to a simpler scale (0–20), we can get a much clearer, more honest picture of what the AI actually knows.

In short: If you want a robot to tell you how sure it is, stop giving it a ruler with 100 marks. Give it a ruler with 20 marks, and it will actually do a better job of measuring.

Here is a detailed technical summary of the paper "Rescaling Confidence: What Scale Design Reveals About LLM Metacognition" by Yuyang Dai.

1. Problem Statement

As Large Language Models (LLMs) are integrated into decision-making pipelines, estimating their uncertainty is critical. In black-box settings, verbalized confidence (where models output a numerical certainty score via prompting) has become the standard approach. However, the design of the confidence scale itself—typically a fixed 0–100 integer range—is rarely scrutinized and is often treated as a neutral instrument.

The paper identifies a critical flaw in current practices: Confidence Discretization. LLMs do not utilize the full 0–100 spectrum continuously. Instead, they heavily cluster their outputs around a small set of "round-number" anchors (e.g., 90, 95, 100).

Consequence: This compression distorts standard calibration metrics like Expected Calibration Error (ECE), making them unreliable.
Hypothesis: The quality of verbalized uncertainty signals is not intrinsic to the model alone but is significantly modulated by the scale design (granularity, boundaries, and range regularity).

2. Methodology

The authors conducted the first systematic empirical study manipulating confidence scales across three orthogonal dimensions to test their effect on metacognitive sensitivity.

Experimental Setup

Models: Six LLMs spanning closed-source (GPT-5.2, Gemini 3.1 Pro) and open-weights (LLaMA-4 variants, Qwen3 variants) with varying architectures (MoE) and parameter sizes.
Datasets: MMLU (knowledge QA), GSM8K (math reasoning), and TruthfulQA (misconception detection).
Metric: The study moves beyond simple accuracy or ECE. It uses meta-d′ (from Signal Detection Theory) to quantify metacognitive sensitivity. This metric measures how well confidence ratings separate correct from incorrect responses, independent of overall response bias. It is normalized by task difficulty ( $M_{ratio} = \text{meta-}d' / d'$ ).

Scale Manipulations

Granularity (G): Tested ranges of [0, 5], [0, 10], [0, 20], [0, 50], and [0, 100]. This tests the trade-off between theoretical resolution and token-level bias.
Boundary Shifting (B): Fixed upper bound at 100, raised lower bounds to [20, 100], [40, 100], and [60, 100]. This probes anchoring effects and whether models redistribute confidence when the range is compressed.
Non-Standard Ranges (N): Used irregular bounds (e.g., [0, 73], [14, 86], [3, 38]) to test if models rely on semantic understanding of ranges or pre-trained token heuristics (preference for multiples of 5/10).

3. Key Results

A. Severe Confidence Discretization

Under the standard [0, 100] scale, all models exhibited extreme discretization:

Clustering: >78% of responses across all models concentrated on just three round-number values.
Dominance: Single values dominated (e.g., Gemini 3.1 Pro reported "100" in 68.4% of cases; GPT-5.2 and Qwen3-235B favored "95").
Entropy: Observed entropy was extremely low (0.95–1.88 bits) compared to the theoretical maximum of 6.66 bits for a uniform distribution.
Correlation: Models with lower entropy (more concentrated outputs) tended to have lower metacognitive efficiency ( $M_{ratio}$ ).

B. The "Sweet Spot" of Granularity

Contrary to the intuition that finer scales provide better resolution, the study found a non-monotonic relationship:

Optimal Scale: The [0, 20] scale consistently yielded the highest metacognitive efficiency ( $M_{ratio}$ ) and sensitivity ( $meta-d'$ ) across all six models and datasets.
Performance: [0, 20] significantly outperformed the standard [0, 100] (e.g., GPT-5.2 $M_{ratio}$ improved from 0.92 to 0.95).
Coarse vs. Fine: The coarsest scale [0, 5] was too restrictive, while the standard [0, 100] introduced noise via token-level biases without adding useful signal.

C. Boundary Shifting and Anchoring

Moderate Shifts: Moving the lower bound to 20 had minimal impact.
Aggressive Shifts: Compressing the range to [60, 100] caused a systematic degradation in metacognitive performance.
Mechanism: Models failed to redistribute confidence across the new range. Instead, they clustered tightly near the new ceiling (100), indicating that high-confidence outputs are driven by token preference rather than genuine self-assessment. Utilization of the available range dropped to as low as 3.1% in some cases.

D. Semantic Robustness and Token Bias

Round-Number Preference: Even with non-standard ranges (e.g., [0, 73]), models continued to cluster around internal round numbers (e.g., 70, 50).
Violation Rates: In narrow, irregular ranges like [3, 38], violation rates (outputs outside the valid range) spiked (up to 21.2%), suggesting models struggle to adhere to semantic constraints when they conflict with pre-trained token heuristics.
Conclusion: LLMs treat confidence numbers primarily as lexical tokens with statistical prevalence in training data, rather than as semantic bounds of a continuous scale.

4. Key Contributions

Discovery of Confidence Discretization: Established that LLMs do not use the 0–100 scale continuously but cluster outputs on a sparse set of anchors, biasing self-reported uncertainty.
Scale Granularity Optimization: Identified [0, 20] as a robust "sweet spot" that improves metacognitive efficiency over the standard [0, 100] format across diverse model families.
Methodological Shift: Demonstrated that meta-d′ is a superior metric to ECE for evaluating verbalized confidence in discretized settings, as ECE becomes unstable when bins are empty or sparse.
Actionable Guidelines: Provided evidence that confidence scale design is a first-class experimental variable that must be controlled, not a neutral backdrop.

5. Significance and Implications

Re-evaluating Calibration: The paper argues that many previous findings regarding LLM overconfidence or poor calibration may be artifacts of the 0–100 scale design rather than intrinsic model failures.
Deployment Safety: In real-world applications, relying on standard 0–100 confidence scores may lead to misplaced trust due to scale artifacts. The authors recommend adopting the [0, 20] scale for eliciting confidence.
Cognitive Insight: The results suggest that LLM "metacognition" is heavily influenced by the statistical properties of their training data (token frequencies) rather than a deep, continuous internal representation of uncertainty.
Future Research: Calls for treating scale design as a hyperparameter in LLM evaluation and suggests that future work should explore the interaction between prompting strategies (e.g., Chain-of-Thought) and scale granularity.

Recommendations from the Authors:

Use a [0, 20] scale instead of [0, 100] for better metacognitive efficiency.
Report meta-d′ alongside ECE to account for discretization artifacts.
Always inspect the empirical distribution of confidence scores before interpreting calibration metrics.