Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

This paper demonstrates that the design of confidence scales significantly impacts LLM metacognition, revealing that standard 0–100 formats induce heavy discretization while 0–20 scales consistently improve metacognitive efficiency.

Yuyang Dai

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Rescaling Confidence" using simple language and creative analogies.

The Big Idea: The "Ruler" Problem

Imagine you ask a very smart, but slightly quirky, robot to tell you how sure it is about an answer. You ask, "On a scale of 0 to 100, how confident are you?"

The robot says, "I'm 95% sure!"

You ask another question. It says, "I'm 95% sure!"

You ask a third. "95% sure!"

It turns out that for most Large Language Models (LLMs), the "0 to 100" scale isn't a smooth, continuous ruler. It's more like a broken ruler with only three or four marks on it. The robot ignores almost all the numbers between 0 and 100 and just picks the same "round" numbers (like 90, 95, or 100) over and over again.

This paper asks: Is the way we ask the question (the scale) actually messing up the robot's ability to tell us how sure it really is?

The answer is a loud YES.


The Analogy: The "Pizza Slice" vs. The "Whole Pie"

Think of the model's confidence as a pizza.

  • The Standard Scale (0–100): You tell the robot, "Cut this pizza into 100 tiny slices and tell me how many slices you think you have."
    • What happens: The robot gets confused. Instead of counting carefully, it just grabs the biggest, most obvious slices it knows (like the "95" slice or the "100" slice). It ignores the tiny, specific slices in between. It's like trying to measure a room with a tape measure that only has marks for "1 foot," "5 feet," and "10 feet." You can't get a precise measurement.
  • The Discovery: The researchers found that if you give the robot a smaller, simpler pizza (a scale of 0 to 20), it actually does a better job of telling you the truth. It stops guessing and starts thinking more clearly.

The Three Experiments (The "What If" Tests)

The researchers played "Mad Libs" with the confidence scale to see what happened. They changed three things:

1. The Granularity (How many numbers?)

  • The Test: They tried scales of 0–5, 0–10, 0–20, 0–50, and the standard 0–100.
  • The Result: The standard 0–100 scale was actually the worst at helping the robot be accurate.
  • The Winner: The 0–20 scale was the "Goldilocks" zone. It wasn't too simple (like 0–5) and not too complicated (like 0–100). When the robot had to choose between 0 and 20, it stopped defaulting to "95" and started giving a more honest, nuanced answer.

2. The Boundary Shifting (Moving the start line)

  • The Test: They told the robot, "Don't use 0 to 100. Use 60 to 100." They tried to force the robot to use the lower numbers by taking them away.
  • The Result: The robot refused to play along. Even when told "60 is the lowest you can go," the robot kept clustering its answers right at the top (near 100).
  • The Metaphor: It's like telling a shy person, "You can only speak if you shout." They don't suddenly become confident; they just get louder and louder, ignoring the middle ground. The robot treats numbers like lexical tokens (words it has seen a million times) rather than mathematical values. It loves the number "100" because it sees it a lot in its training data, not because it's actually 100% sure.

3. The Weird Scales (Breaking the rules)

  • The Test: They gave the robot weird ranges, like 0 to 73 or 3 to 38. They wanted to see if the robot would stop using "round numbers" (multiples of 5 or 10) if those numbers weren't at the ends of the scale.
  • The Result: The robot still loved round numbers! Even in a range of 3 to 38, it kept picking 35 or 30.
  • The Takeaway: The robot has a "habit" of picking round numbers, just like a human might always order the "medium" size even if they are asked to pick a size between 1 and 7. It's a bias built into its brain (or rather, its code).

Why Does This Matter?

Currently, when we build AI systems, we trust the robot's "confidence score" to decide if we should listen to it.

  • If the robot says "95% sure," we assume it's very confident.
  • But this paper shows that "95%" might just be the robot's favorite default setting, not a real calculation.

If we keep using the 0–100 scale, we are getting bad data. We think the robot is calibrated (accurate), but it's actually just "rounding off" its thoughts.

The Solution: Change the Ruler

The paper suggests three simple rules for anyone using AI confidence:

  1. Use a 0–20 Scale: Instead of asking for 0–100, ask for 0–20. It forces the robot to be more specific and less likely to just default to "100."
  2. Don't Trust the "Perfect" Score: If a robot says it's 100% sure, it might just be because "100" is the top of the ruler, not because it's actually perfect.
  3. Check the Distribution: Before you trust the numbers, look at the histogram. If 80% of the answers are "95" or "100," the scale is broken, and the data is useless.

The Bottom Line

The way we ask a question changes the answer. By treating the "confidence scale" as a neutral tool, we've been accidentally tricking the AI into lying (or at least, being lazy). By switching to a simpler scale (0–20), we can get a much clearer, more honest picture of what the AI actually knows.

In short: If you want a robot to tell you how sure it is, stop giving it a ruler with 100 marks. Give it a ruler with 20 marks, and it will actually do a better job of measuring.