Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types

Imagine you are trying to listen to a very faint whisper in a room that is already filled with loud, static noise. That is essentially what this paper is about, but instead of sound, it's about how AI models "think" when they make mistakes.

Here is the story of the paper, broken down into simple concepts and analogies.

The Big Problem: Three Ways AI Gets Lost

The researchers are studying "hallucinations"—times when an AI makes things up. They've already figured out that there are three distinct ways an AI can get lost, like a hiker in a forest:

Type 1 (The Drifter): The AI is confused and wanders aimlessly toward the center of the forest, not really knowing where to go. It's weak and directionless.
Type 2 (The Wrong Turn): The AI is actually very confident! It picks a specific path and walks straight down it. The problem is, it's the wrong path. It's committed to a lie.
Type 3 (The Dead End): The AI is asked a question that has no answer in its memory (like "What is the color of the number 5?"). It hits a wall and produces weak, nonsensical output because it has nowhere to go.

The Mystery: In previous experiments, the researchers could easily spot the "Dead End" (Type 3). But they couldn't tell the difference between the "Drifter" (Type 1) and the "Wrong Turn" (Type 2). To their measuring tools, both looked like the same kind of confusion.

The Solution: The "Whitening" Glasses

The researchers realized their measuring tools were like trying to see a faint star with the naked eye during the day. The "noise" of the AI's normal thinking was drowning out the subtle differences between the mistakes.

They invented a new way to look at the data called Whitening.

The Analogy: Imagine you have a photo that is too bright and washed out. You can't see the details. "Whitening" is like putting on special sunglasses that adjust the contrast. It doesn't change the picture; it just makes the subtle shadows and highlights pop out so you can actually see them.

The Big Discovery: "Commitment" is the Key

Once they put on these "Whitening Glasses," they found a new way to measure the AI's mistakes. They stopped looking at how "scattered" the thoughts were (which didn't work) and started looking at how committed the AI was to a specific idea.

Type 2 (The Wrong Turn) was the most committed. It was like a person shouting, "I am definitely going to the beach!" even though they are in a desert. They are very focused on one spot.
Type 1 (The Drifter) was in the middle. They were wandering, not fully committed to anything.
Type 3 (The Dead End) had zero commitment. They were looking at a blank wall.

The Result: The "Whitening Glasses" successfully separated the "Wrong Turn" from the "Dead End." The AI's "commitment" level was the secret code that told them which mistake was happening.

The Twist: The "Fake" Signal

Here is where it gets interesting. When the researchers first tried this with a small group of test questions (15 questions), they thought they had found a different solution involving "entropy" (a measure of chaos). It looked like a huge breakthrough!

But when they added more variety to the test questions (expanding to 30 questions), that "huge breakthrough" vanished.

The Analogy: It's like testing a new diet on 15 people who all happen to love pizza. You think the diet works because they lost weight. But when you test it on 30 people with different tastes, the weight loss disappears. The first result was a fluke caused by the specific group of people (or prompts) you chose.
The Lesson: In the world of AI, tiny differences are so fragile that the specific questions you ask can trick you. You need a very diverse set of questions to be sure you aren't seeing a ghost.

The Final Verdict: It's a Capacity Issue

The researchers tried to see if the difference between the "Drifter" and the "Wrong Turn" was hidden in a specific part of the AI's brain (a specific frequency band). They looked everywhere, but they couldn't find it.

The Conclusion: The difference isn't hidden; it's just that the AI they used (GPT-2-small) is too small to make that distinction clearly.

The Analogy: Imagine trying to tell the difference between two shades of blue using a black-and-white TV. No matter how you adjust the contrast, you can't see the difference because the TV isn't powerful enough.
The Prediction: The researchers predict that if you use a much bigger, smarter AI (with more "brain power"), it will be able to tell the difference between drifting and taking a wrong turn. The "Whitening Glasses" revealed the potential for the difference, but the current AI just isn't strong enough to show it clearly yet.

Summary for the Everyday Reader

AI makes three types of mistakes: Wandering, confidently lying, or hitting a dead end.
Old tools couldn't tell the first two apart.
New tools ("Whitening") revealed that "confidence" (commitment) is the key: Liars are confident; wanderers are not.
Beware of small test groups: Sometimes results look real just because of the specific questions you asked.
The AI is just too small: The current model is too weak to perfectly distinguish between a confused wanderer and a confident liar, but bigger models will likely be able to do it.

This paper teaches us that to catch AI hallucinations, we need to look at how focused the AI is, not just how chaotic it seems, and we need bigger brains to catch the subtlest lies.

Here is a detailed technical summary of the paper "Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types."

1. Problem Statement

The paper addresses a critical gap in understanding hallucination mechanisms in Large Language Models (LLMs). Previous work (Korun, 2026b) established a geometric taxonomy of hallucinations based on token embedding clusters:

Type 1 (Center-drift): Generation drifts toward the embedding centroid due to weak context (low cluster membership).
Type 2 (Wrong-well): The model commits to a locally coherent but contextually incorrect cluster.
Type 3 (Coverage gap): The query requires semantic combinations absent from training, resulting in weak alignment with any cluster.

While Type 3 was distinguishable from Types 1 and 2 using full-dimensional metrics, Type 1 and Type 2 remained indistinguishable in standard measurements. The authors hypothesized two potential causes for this "collapse":

Capacity Limitation: The 124M-parameter GPT-2-small lacks the precision to encode the difference between weak context and misrouted context.
Spectral Mixing: The distinguishing signal exists in specific eigenspectrum bands but is diluted when aggregated across all principal components (PCs).

2. Methodology

The study employs PCA-whitening and eigenspectrum decomposition on GPT-2-small (124M parameters, 768D hidden states) to amplify "micro-signals" that exist in the near-saturated similarity regime of contextual embeddings.

Data Generation:
- Model: GPT-2-small with manual autoregressive decoding (temperature 1.0).
- Prompts: Expanded from 15 to 30 prompts per condition (90 total) to test robustness against prompt-set sensitivity.
  - Type 1: Generic low-constraint starters.
  - Type 2: Lexical polysemy and garden-path constructions.
  - Type 3: Pseudo-academic nonsense and contradictions.
- Stability Analysis: 20 independent generation seeds per experiment to isolate stochasticity.
Preprocessing (Whitening):
- Centered vectors are projected onto the top 256 PCs (99.7% variance).
- Scaled by $1/\sqrt{\lambda_i + \epsilon}$ to equalize variance across dimensions, transforming the space from a narrow cone (anisotropic) to an isotropic space where deviations become first-order effects.
Metrics:
- Peak Cluster Alignment (max sim): Maximum cosine similarity to any cluster centroid.
- Cluster Membership Entropy (H(v)): Shannon entropy of softmax similarities.
- Norm: Vector magnitude (both raw and whitened).
Statistical Analysis:
- Prompt-level aggregation (Mann-Whitney U test).
- Holm-Bonferroni correction for multiple comparisons.
- Pseudoreplication Ratio: Comparing token-level vs. prompt-level significance to distinguish genuine between-condition effects from token-level noise.

3. Key Contributions

Identification of the Correct Metric: The paper demonstrates that Peak Cluster Alignment (max sim), not entropy ( $H(v)$ ), is the theoretically correct metric for distinguishing hallucination types. It directly measures "cluster commitment."
Resolution of the Type 1/2 Collapse: The study provides evidence that the inability to distinguish Type 1 and Type 2 is a capacity limitation of the 124M model, not a measurement artifact or spectral mixing issue.
Methodological Insight on Prompt Sensitivity: The paper reveals that in "micro-signal regimes" (where differences exist in the 4th decimal place), small prompt sets (N=15) can produce false positives that appear robust across seeds but vanish upon prompt diversification (N=30).

4. Key Results

A. Separation of Type 2 and Type 3

Result: Whitened max sim successfully separates Type 2 (wrong-well) from Type 3 (coverage gap) with 40% Holm-corrected significance ( $r = -0.31$ , direction 20/20).
Ordering: The means follow the predicted taxonomy: Type 2 > Type 1 > Type 3.
- Type 2 shows the highest alignment (committed to a wrong cluster).
- Type 3 shows the lowest (no alignment).
Significance: The signal is a genuine between-condition difference, evidenced by an inverted pseudoreplication ratio (0.3x), meaning prompt-level effects are stronger than token-level effects.

B. The Emergent Type 1/2 Signal

Result: A directional hint of separation between Type 1 and Type 2 appears via max sim ( $r = +0.21$ , 15% Holm, 17/20 directional stability).
Interpretation: While statistically underpowered at 124M parameters, the sign is consistent with the theory (Type 2 > Type 1). This suggests the geometric vocabulary exists but requires larger models for reliable detection.

C. The Entropy ( $H(v)$ ) False Positive

Finding: At $N=15$ , whitened entropy appeared to be the strongest metric. However, expanding to $N=30$ caused this signal to collapse completely (Holm survival dropped to 0%).
Cause: The original 15 prompts happened to align with the dominant principal components (PCs 1–16). Diversification distributed energy evenly, eliminating the artifact.
Spectral Localization: Eigenspectrum decomposition confirmed the artifact was localized to the dominant PCs (1–16). No spectral band showed Type 1/2 separation, rejecting the spectral mixing hypothesis.

D. Norm Analysis

Whitened Norm: Destroyed by the whitening process (as expected).
Raw Norm: Reproduced previous findings (Type 1 > Type 2 > Type 3) but with moderate effect sizes, confirming it as a complementary but less precise channel than max sim.

5. Significance and Implications

Theoretical Reframing: The study shifts the focus from "distributional spread" (entropy) to "cluster commitment" (max sim) as the defining geometric property of hallucinations.
Capacity Prediction: The results predict that larger models will exhibit a wider gap between Type 1 and Type 2, as they possess sharper contextual attractors. The distinction is a matter of precision, not a fundamentally different representation.
Detection Pipeline Recommendations:
1. Apply full-spectrum whitening to contextual hidden states to make cluster structure legible.
2. Use max sim as the primary detector for distinguishing "wrong-well" (Type 2) from "coverage gaps" (Type 3).
3. Treat Type 1 and Type 2 as undifferentiated in current small-scale models; practical systems should classify tokens as "Coverage Gap" vs. "Non-Coverage Gap."
Methodological Warning: Experiments in near-saturated representation spaces are highly sensitive to prompt selection. Prompt diversification is essential to avoid false positives that survive multi-seed validation but fail to generalize.

Conclusion

This paper resolves the ambiguity of hallucination types in small LLMs by demonstrating that whitening reveals cluster commitment as the key geometric separator. While Type 2 and Type 3 are now distinguishable, the boundary between Type 1 and Type 2 remains a capacity limitation of the 124M model, offering a clear scaling prediction for future research.