Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores

The Big Problem: The Confident Liar

Imagine you ask a very smart, well-read friend (an AI) a question. They answer instantly, with perfect grammar and total confidence. But, they are completely wrong. This is called a hallucination.

Current AI models are great at sounding confident, even when they are guessing. We need a way to tell when the AI is "sure" and when it's just "bluffing." This is called Uncertainty Estimation.

The Old Ways: Why They Fail

The paper looks at how we usually try to catch these lies:

The "Output" Method: We look at the AI's final answer. If the AI says, "I'm 99% sure this is the capital of France," we trust it.
- The Flaw: Sometimes the AI is 99% sure it's Paris, but it's actually London. The AI can be confidently wrong. It's like a con artist who speaks so smoothly you believe them.
The "Internal Probe" Method: We try to peek inside the AI's brain (its hidden layers) to see if it's nervous.
- The Flaw: The AI's brain is huge and messy. It's like trying to find a specific thought in a library with millions of books, all written in a secret code. It's hard to do, and if you move to a new topic (like switching from history to math), the method often breaks.

The New Solution: The "Team Meeting" Analogy

The authors propose a clever new way to check the AI's confidence. Instead of looking at the final answer or the messy whole brain, they look at how the different parts of the AI's brain agree with each other.

Imagine the AI is a company with 100 managers (layers) sitting in a row.

Manager 1 reads the question.
Manager 2 passes a note to Manager 3, and so on, until Manager 100 writes the final answer.

Usually, if the answer is correct, all the managers are on the same page. They pass notes that flow smoothly.
But if the AI is hallucinating (making things up), the managers start arguing.

Manager 10 thinks the answer is "Apple."
Manager 50 thinks it's "Orange."
Manager 90 is confused.

The Paper's Method:
The authors measure the "disagreement" between every pair of managers. They create a Scorecard (a grid) that shows how much Manager A disagrees with Manager B.

Low Disagreement (Smooth Flow): The team is united. The AI is likely correct.
High Disagreement (Chaos): The team is fighting. The AI is likely lying or guessing.

Why This is Special

It's Compact: Instead of reading the whole library of books (the massive internal data), they just look at the Scorecard. It's a tiny, simple summary of the team's mood.
It's Fast: They can check this scorecard in a single pass. No need to run the AI twice or ask it the same question ten times.
It Travels Well: This is the biggest win. If you train the system to spot "team arguments" on a History test, it works great on a Math test too. The old methods (probing) usually fail when you switch subjects, but this "Team Agreement" method works everywhere.
It Survives Compression: Even if you shrink the AI to make it run on a phone (quantization), this method still works. It's robust.

The Result

The researchers tested this on three different giant AI models. They found:

When the AI is in its "home turf" (familiar data), this new method is just as good as the old, complicated methods.
When the AI is in a new situation (different data or tasks), this new method is much better at spotting the lies.
It gives a better "confidence score," meaning we can trust the AI's "I'm not sure" warnings more.

The Takeaway

The paper suggests that truth is found in the agreement between the layers. By listening to how the different parts of the AI's brain talk to each other, we can tell if the AI is telling the truth or just making noise. It's a lightweight, fast, and reliable way to stop the AI from confidently lying to us.

1. Problem Statement

Large Language Models (LLMs) frequently produce "hallucinations"—outputs that are grammatically fluent and confidently stated but factually incorrect. This miscalibration poses significant risks in safety-critical and knowledge-intensive applications. Existing Uncertainty Estimation (UE) methods suffer from specific limitations:

Output-based heuristics (e.g., token entropy, margin) are computationally cheap but brittle, often failing under distribution shifts or conflating surface forms with semantic confidence.
Bayesian surrogates (e.g., MC Dropout, ensembles) are expressive but computationally prohibitive for large-scale LLMs, requiring multiple forward passes.
Probing internal representations (training classifiers on hidden states) is effective but relies on high-dimensional, opaque vectors that are often task-specific, difficult to interpret, and struggle to generalize across datasets.

The authors aim to develop a lightweight, per-instance UE method that leverages internal model signals without the computational cost of ensembles or the dimensionality/transferability issues of standard probing.

2. Methodology

The proposed approach, termed Signature Maps, operates by structuring internal signals before learning from them. The pipeline consists of three main stages:

A. Layer-wise Activation Distributions

For a given input sequence, the method extracts post-MLP hidden activations ( $h^{(t)}_\ell$ ) for specific task-relevant tokens (e.g., the final token or answer tokens) across all $L$ layers.

Instead of using raw vectors, each activation vector is transformed into a probability distribution using a temperature-scaled softmax:
$p^{(t)}_\ell = \text{Softmax}\left(\frac{h^{(t)}_\ell}{\tau}\right)$
This normalization allows for the application of probabilistic distance metrics.

B. Signature Map Construction (KL Divergence)

The core innovation is the construction of an $L \times L$ matrix representing the directed Kullback-Leibler (KL) divergence between every pair of layers for a specific token:
$S^{(t)}_{ij} = D_{KL}(p^{(t)}_i \parallel p^{(t)}_j)$

This matrix captures the cross-layer agreement and disagreement patterns.
The authors optionally apply a contrast transformation ( $S' = 1 - e^{-\alpha S}$ ) to improve dynamic range.
The matrix is flattened into a feature vector $z(x)$ . Crucially, the dimensionality of this representation is $L^2$ , which is significantly smaller than the raw hidden state dimension ( $L \times d_{hidden}$ ) used in standard probing.

C. Uncertainty Estimation

A lightweight Gradient Boosted Decision Tree (specifically LightGBM) is trained on these signature maps to predict the probability of correctness ( $q(x)$ ).

The uncertainty score is derived as $u(x) = 1 - q(x)$ .
This requires only a single forward pass and no architectural changes to the LLM.

3. Key Contributions

Information-Theoretic Signatures: The authors introduce a structured representation of neuronal activations using directed KL divergence between layers. This captures how information evolves across depth, offering a compact alternative to raw hidden states.
Compact Estimator: The method uses a lightweight classifier on $L \times L$ maps, avoiding the high dimensionality of probing while maintaining per-instance inference speed.
Comprehensive Evaluation: The method was evaluated across three models (Llama-3.1-8B, Qwen3-14B-Instruct, Mistral-7B-Instruct-v0.3) and a diverse suite of datasets (TriviaQA, HotpotQA, MMLU, etc.), comparing against linear probing and other recent methods like LOS-NET and ACT-ViT.

4. Experimental Results

The paper presents three key experimental findings:

In-Distribution Performance:
- The Signature method achieves near-parity with probing in terms of error detection (AUPRC), with mean diagonal differences of at most -1.8 percentage points.
- However, it significantly outperforms probing in probabilistic calibration (Brier score), showing gains of up to +4.9 points. This suggests the signatures preserve better probability estimates of correctness.
Cross-Dataset Transfer (Generalization):
- The method demonstrates superior transferability. When trained on one task and tested on another, Signature maps consistently outperform probing.
- Off-diagonal gains reached +2.86 AUPRC and +21.02 Brier points, indicating that cross-layer agreement patterns capture task-agnostic signals of correctness that raw hidden states (which overfit to task-specific cues) miss.
Robustness to Quantization:
- Under 4-bit weight-only quantization, the method remains robust, improving over probing by +1.94 AUPRC and +5.33 Brier points on average. This highlights the resilience of structured divergence signals compared to raw activations.
Complexity vs. Performance:
- The representation size ( $L^2$ ) is orders of magnitude smaller than linear probing ( $L \times d_{hidden}$ ) or methods like ACT-ViT, yet it achieves comparable or superior performance, particularly in complex reasoning tasks.

5. Significance and Interpretability

Bridging the Gap: The method occupies a conceptual middle ground between Probing (which uses raw, high-dimensional states) and the Information Bottleneck (IB) theory (which analyzes global mutual information). It provides instance-level, interpretable signals without requiring impractical distribution-level estimations.
Mechanistic Insights: By analyzing feature importance in the trained GBDT, the authors reveal distinct "agreement topologies" between models. For example, Mistral shows importance decaying with inter-layer distance (local refinement), while Qwen shows influential interactions persisting across longer distances. This offers a new lens for studying how different models encode uncertainty.
Practical Deployment: The method is model-agnostic, requires no retraining of the LLM, and operates efficiently at inference time, making it suitable for real-world applications requiring calibration, abstention, or triage.

Conclusion

"Between the Layers Lies the Truth" proposes that the structure of agreement between layers is a more robust and transferable signal for uncertainty estimation than the raw content of individual layers. By compressing internal representations into KL-divergence signature maps, the authors provide a lightweight, highly effective tool for detecting hallucinations and calibrating LLM confidence, outperforming state-of-the-art probing methods especially in cross-task and quantized scenarios.