How do LLMs Compute Verbal Confidence

Imagine you ask a very smart, but slightly mysterious, AI assistant a difficult question. It thinks for a moment, gives you an answer, and then you ask, "How sure are you about that?"

The AI replies, "I'm 95% confident."

For a long time, researchers wondered: Is the AI actually thinking about how sure it is while it's figuring out the answer? Or is it just making that confidence number up on the spot after it's already finished speaking?

This paper, titled "How do LLMs Compute Verbal Confidence?", investigates exactly that. The researchers acted like detectives inside the AI's brain to find out how it generates that "95%."

Here is the story of their discovery, explained with some everyday analogies.

The Two Suspects: "Just-in-Time" vs. "The Filing Cabinet"

The researchers had two main theories about how the AI works:

The "Just-in-Time" Chef (The JIT Hypothesis): Imagine a chef who cooks a meal. When you ask, "How good is this dish?" after it's plated, the chef tastes it, thinks for a second, and then tells you. In this theory, the AI generates the answer first, and only when asked does it pause, look back at what it said, and calculate a confidence score from scratch.
The "Filing Cabinet" Clerk (The Cached Retrieval Hypothesis): Imagine a clerk who, while writing a report, secretly writes a little sticky note next to every paragraph saying, "This part is solid" or "This part is shaky." When you ask for the confidence score later, the clerk doesn't re-read the whole report; they just grab the sticky note they wrote earlier. In this theory, the AI calculates its confidence automatically while it's generating the answer and stores it away for later.

The Verdict: The researchers found that the AI is definitely the Filing Cabinet Clerk. It calculates its confidence while it is answering, stores that feeling, and then simply retrieves it when asked.

How Did They Prove It? (The Detective Work)

The researchers used some high-tech "surgery" on the AI's brain (which is actually a giant computer program called a Transformer). Here are the tools they used, translated into simple terms:

1. The "Remote Control" (Activation Steering)

Imagine the AI's brain has a specific volume knob for "confidence." The researchers found a way to turn that knob up or down.

The Test: They turned the "confidence knob" up right after the AI finished its answer but before it was asked to state its confidence.
The Result: The AI's final confidence rating went up! This proved that the confidence information was already sitting there, waiting to be used, just like a file sitting on a desk before you ask for it. If the AI were doing "Just-in-Time" math, turning the knob at that moment wouldn't have worked.

2. The "Amnesia Pill" (Activation Patching)

Imagine you give the AI a pill that makes it forget the answer it just wrote, but you secretly slip a "clean" memory of the answer back into one specific spot in its brain.

The Test: They wiped out the AI's memory of the answer, making it confused. Then, they "patched" (restored) the memory only at the specific spot where the "Filing Cabinet" (the newline after the answer) was located.
The Result: The AI suddenly remembered how confident it should be! This proved that the "Filing Cabinet" spot holds the crucial confidence data.

3. The "Swap Meet" (Activation Swap)

This was the most clever trick. They took a "very confident" AI's brain state right after it answered a question and swapped it into a "very unsure" AI's brain.

The Test: They took the "confidence feeling" from a correct answer and pasted it into a trial where the AI was actually unsure.
The Result: The unsure AI suddenly started acting confident! This proved that the confidence signal is a distinct, transferable "feeling" stored in the brain, not just a byproduct of the words themselves.

The Big Surprise: It's Not Just a "Gut Feeling"

There was a second mystery: What exactly is this confidence number?

Some people thought the AI was just looking at how "smooth" or "fluent" its answer sounded. If the words flowed easily, it felt confident. If the words were clunky, it felt unsure. This is like a student feeling confident just because they wrote a long essay, even if the essay is nonsense.

The researchers found this wasn't true.

They discovered that the AI's confidence signal contains much more information than just how fluent the words were.
It's like a teacher grading a test. A "Just-in-Time" fluency check is like saying, "This handwriting is neat, so I must be right." But the AI is actually doing a second-order check: It's looking at the answer and asking, "Does this fact actually match what I know about the world?"

This means the AI has a form of metacognition (thinking about thinking). It isn't just a parrot repeating words; it has an internal "quality control" system that evaluates its own work while it's being done.

The Information Highway

Finally, they traced the path of this confidence signal:

The Answer: The AI generates the answer.
The Filing Cabinet (PANL): Immediately after the last word of the answer, the AI reads the whole answer, calculates how good it is, and stores that "confidence score" in a specific spot (the newline character).
The Retrieval (CC): When the prompt asks for the confidence rating, the AI looks back at that stored score and says it out loud.

Why Does This Matter?

This is a huge deal for two reasons:

Trust: It means we can't just trick the AI into sounding confident. If it says it's 95% sure, it's actually running a sophisticated internal check, not just guessing based on how smooth the sentence sounds.
Safety: If we understand how the AI knows it's unsure, we can build better safety systems. We can teach the AI to trust its own "Filing Cabinet" and say, "I don't know," before it hallucinates a fake fact.

In short: The AI isn't faking confidence. It's doing the math in real-time, filing the result, and then handing it to you when asked. It's a sophisticated, automatic self-check that happens before you even know you're going to ask.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in high-stakes applications where estimating uncertainty is critical. While "verbal confidence" (prompting a model to output a number or category like "Almost certain") is a common method to extract these estimates, the internal mechanisms generating these scores remain a "black box."

The authors address two fundamental questions:

When is confidence computed?
- Just-in-Time (JIT) Hypothesis: The model computes confidence only when explicitly prompted to self-evaluate (at the end of the prompt), integrating question and answer features at that specific moment.
- Cached Retrieval Hypothesis: The model automatically computes confidence during answer generation, stores (caches) it in internal representations, and retrieves it later when verbalization is requested.
What does verbal confidence represent?
- First-Order (Fluency) Hypothesis: Confidence is merely a readout of token log-probabilities (generation fluency).
- Second-Order (Evaluation) Hypothesis: Confidence represents a distinct, richer evaluation of "question-answer fit" or answer quality, independent of simple token likelihoods.

2. Methodology

The study focuses on Gemma 3 27B and Qwen 2.5 7B models using the TriviaQA dataset. The authors employ a suite of mechanistic interpretability techniques to trace information flow and causality within the Transformer architecture.

Key experimental setups:

Prompt Design: Models were prompted to generate an answer, followed by a newline, and then a request for a confidence rating (categorical or numeric). Crucially, the answer was provided as context in the prompt for the confidence rating phase to ensure the model's internal state at the "confidence-colon" token was identical to autoregressive generation.
Key Tokens Identified:
- PANL (Post-Answer-Newline): The token immediately following the generated answer.
- CC (Confidence-Colon): The token immediately preceding the confidence output (e.g., the : before the confidence class).
- Control Tokens: PANL+1 (token after newline), FCC (First Confidence Colon), etc.

Intervention Techniques:

Activation Steering: Injecting vectors derived from high/low confidence trials into specific layers and token positions to observe if confidence ratings shift bidirectionally.
Activation Patching: Corrupting answer token activations (replacing with mean activations) to disrupt the model's ability to assess correctness, then selectively "patching" (restoring) clean activations at specific positions (PANL, CC) to see if confidence is recovered.
Activation Noising (Mean Ablation): Replacing activations at specific positions with the mean of high/low confidence trials to test necessity.
Activation Swap (Interchange Intervention): Swapping residual stream activations from a high-confidence donor trial into a low-confidence recipient trial (and vice versa) to test if confidence information transfers across trials independent of content.
Attention Blocking: Systematically zeroing out attention weights between specific token pairs (e.g., CC $\to$ Question/Answer) to map information flow pathways.
Linear Probing & Variance Partitioning: Training linear classifiers/regressors on residual stream activations to decode correctness and confidence magnitude, then partitioning variance to determine if confidence signals contain information beyond token log-probabilities.

3. Key Contributions & Results

A. Timing: Evidence for Cached Retrieval

The authors provide convergent evidence that confidence is automatically computed and cached during answer generation, not computed just-in-time.

Temporal Precedence: Activation steering, patching, and noising effects emerged at earlier layers (approx. layers 21–25) at the PANL token compared to the CC token (approx. layers 30–35). This indicates confidence is encoded at PANL before being retrieved at CC.
Causal Flow: Attention blocking experiments revealed a sequential flow:
1. Answer Tokens $\to$ PANL: PANL attends to answer tokens (specifically the last answer token) to gather confidence-relevant information.
2. PANL $\to$ CC: The confidence information is routed from PANL to the CC token (often via intermediate template tokens in complex prompts).
3. No JIT Computation: Blocking CC's attention to the question or answer tokens had minimal effect, ruling out the hypothesis that the model re-computes confidence from scratch at the verbalization site.

B. Content: Evidence for Second-Order Evaluation

Verbal confidence is not merely a readout of token log-probabilities.

Variance Partitioning: Linear probes revealed that PANL and CC activations explain a substantial amount of variance in verbal confidence ( $R^2 \approx 0.084$ ) beyond what is explained by token log-probabilities alone.
Distinct Signal: While token log-probabilities are significant predictors of correctness (AUROC $\approx$ 0.75), they are poor predictors of the verbal confidence score. This suggests the model performs a distinct, "second-order" evaluation of answer quality (question-answer fit) that goes beyond generation fluency.
Swap Experiments: Swapping PANL activations between high and low confidence trials systematically shifted the recipient's confidence output, even when the content (question/answer) remained unchanged. This proves PANL encodes abstract confidence information, not just content features.

C. Generalization

The findings held across:

Model Architectures: Both Gemma 3 27B and Qwen 2.5 7B.
Prompt Formats: Both categorical (e.g., "Almost certain") and numeric (0–100) confidence prompts.

4. Significance and Implications

Metacognition in LLMs: The study demonstrates that LLMs possess a sophisticated form of metacognition. They do not simply "guess" confidence when asked; they maintain an internal, cached representation of answer quality generated alongside the answer itself.
Error Detection: The existence of a second-order confidence signal implies that LLMs have the internal capacity to detect errors even after committing to an answer (a capability pure first-order models lack). This supports the idea that LLMs can distinguish between "what they generated" and "how good it is."
Calibration and Safety: Understanding that confidence is cached rather than reconstructed allows for principled interventions. Instead of relying on prompt engineering or post-hoc fine-tuning, researchers could potentially steer or patch these cached representations to improve calibration and reduce hallucinations.
Mechanistic Insight: The paper maps the specific "circuit" for confidence: Answer Tokens $\to$ PANL (Early Layers) $\to$ CC (Late Layers) $\to$ Output. This provides a concrete blueprint for how uncertainty is processed in Transformer architectures.

Conclusion

The paper refutes the "Just-in-Time" and "Fluency-Readout" hypotheses. Instead, it establishes that LLMs compute verbal confidence via automatic, cached retrieval of a second-order evaluation of answer quality. This confidence signal is distinct from token probabilities, emerges early in the network during answer generation, and is retrieved later for verbalization, indicating a level of introspective awareness in current LLMs.