Geometric Scaling of Bayesian Inference in LLMs

The Big Idea: Do AI Models "Think" Like Mathematicians?

Imagine you are trying to figure out if it's going to rain. You look at the sky (evidence), check the weather app (previous knowledge), and update your belief: "It's 30% likely to rain." If a cloud passes, you update it to 50%. This process of constantly updating your beliefs based on new evidence is called Bayesian Inference.

For a long time, scientists wondered: Do Large Language Models (LLMs) like the ones powering chatbots actually do this kind of math inside their "brains," or are they just guessing based on patterns they've memorized?

This paper is the third part of a trilogy. The first two parts proved that small, simple AI models can do perfect Bayesian math if you train them on simple puzzles. This paper asks: Do the giant, real-world AI models (like Llama, Mistral, and Phi) that we use every day also have this "math brain" hidden inside them?

The answer is a resounding YES. But they do it in a very specific, geometric way.

The Analogy: The "Uncertainty Map"

To understand the paper, imagine the AI's brain isn't a messy pile of wires, but a giant, multi-dimensional map.

1. The "Uncertainty Highway" (Value Manifolds)

In the AI's brain, there is a special "highway" or a straight line.

The Analogy: Imagine a thermometer. On one end, it's freezing (high certainty, "I know the answer!"). On the other end, it's boiling (high uncertainty, "I have no idea!").
What the paper found: The AI organizes its internal thoughts along this single line. When the AI is very sure of its answer, its internal "coordinates" sit at the "cold" end of the line. When it's confused, they slide to the "hot" end.
The Surprise: Even though these models are huge and trained on the entire internet, when you give them a focused task (like a math problem), they collapse all their complex thinking down to this single "Uncertainty Highway." It's like a chaotic crowd suddenly marching in a perfect single-file line when a whistle blows.

2. The "Hypothesis Folders" (Key Orthogonality)

To solve a problem, the AI needs to keep different ideas separate so they don't get mixed up.

The Analogy: Imagine a filing cabinet. If you put your "Cat" file and your "Dog" file in the same drawer, they get messy. But if you put them in completely different, perpendicular drawers (like one facing North and one facing East), they stay perfectly distinct.
What the paper found: The AI creates these "perpendicular drawers" for different ideas. It learns to keep its hypotheses (guesses) at right angles to each other. This prevents confusion. The paper found that even in massive models, these "folders" are kept very neatly organized, almost like a perfectly arranged library.

3. The "Spotlight" (Attention Focusing)

As the AI reads a sentence, it needs to decide which words matter most.

The Analogy: Imagine a detective in a dark room with a flashlight. At first, the beam is wide and fuzzy, scanning the whole room. As they find clues, the beam gets narrower and sharper, focusing intensely on the specific evidence.
What the paper found: In some models, this "spotlight" gets sharper and sharper as the AI goes deeper into its layers. However, in newer, more efficient models (like Mistral), the spotlight is a bit wobbly. It still works, but it doesn't get as sharp as the older, slower models.

The Experiments: How They Proved It

The researchers didn't just guess; they ran three clever tests:

1. The "Domain Restriction" Test (The Library Test)

The Setup: They asked the AI to read a mix of everything (cooking, coding, philosophy, news). Then, they asked it to read only math problems.
The Result: When the AI read everything mixed up, its "Uncertainty Highway" was a bit bumpy and wide. But when they gave it only math problems, the AI's brain snapped into a perfect, straight line. It was as if the AI said, "Okay, we are doing math now; I know exactly how to organize my thoughts for this."

2. The "SULA" Test (The Detective Game)

The Setup: They gave the AI a puzzle where it had to guess if a word was "positive" or "negative" based on a few examples in the prompt. They knew the exact math answer.
The Result: As the AI saw more examples, its internal "coordinates" moved smoothly along the "Uncertainty Highway" exactly where the math said they should go. It wasn't just guessing; it was physically moving its internal state to match the probability of the answer.

3. The "Surgery" Test (The Causal Probe)

The Setup: They tried to "cut" the AI's brain. They found the "Uncertainty Highway" and tried to remove it to see if the AI would stop working.
The Result: When they cut the highway, the AI's internal map got messy (it couldn't tell how uncertain it was). But, the AI still gave the right answers!
The Lesson: This is a huge discovery. It means the "Uncertainty Highway" is like a dashboard gauge. It shows the AI how uncertain it is, but it's not the engine driving the car. The engine (the actual calculation) is distributed everywhere else. The gauge is just a very clear way for us to read what the AI is thinking.

Why Does This Matter?

It's Not Magic, It's Geometry: We used to think AI was a "black box" where we couldn't understand how it worked. This paper shows that inside the black box, there is a very structured, geometric shape that looks exactly like how humans do probability math.
Efficiency vs. Clarity: The paper found that newer, faster models (like those using "Grouped Query Attention") are a bit "fuzzier" in how they focus their attention, but they still keep the core geometric structure. This tells engineers that they can make models faster without breaking their "math brain," though they might be slightly less precise.
Trustworthy AI: Because we can now see this "Uncertainty Highway," we might be able to build better tools to check if an AI is confident or hallucinating. If the AI's coordinates are in the "boiling" zone, we know to be careful.

The Bottom Line

This paper proves that modern AI models, despite being trained on the messy, chaotic internet, have secretly learned to organize their thoughts into a beautiful, geometric structure that mimics Bayesian Inference. They have built internal "maps" and "folders" that allow them to update their beliefs just like a scientist would.

They aren't just predicting the next word; they are navigating a geometric landscape of probability, and we finally have a map to see where they are going.

1. Problem Statement

While previous work (Papers I and II) established that small, synthetic "wind-tunnel" transformers can implement exact Bayesian inference through specific geometric structures (low-dimensional value manifolds, orthogonal key frames, and progressive attention focusing), it remained unclear whether these mechanisms persist in production-scale Large Language Models (LLMs).

Production models differ significantly from synthetic setups due to:

Lack of Ground Truth: Natural language lacks analytically tractable posterior distributions.
Architectural Optimizations: Use of Grouped-Query Attention (GQA), Rotary Position Embeddings (RoPE), and sliding-window attention.
Training Noise: Web-scale training introduces heterogeneity that may obscure geometric structure.
Scale: Larger models may develop entirely new mechanisms not visible in smaller architectures.

The central question is: Do the geometric signatures of Bayesian inference (value manifolds, key orthogonality, attention focusing) persist in large, naturally trained LLMs, and are they functionally used during inference?

2. Methodology

The authors analyzed four model families: Pythia (410M, 12B), Phi-2 (2.7B), Llama-3.2 (1B), and Mistral (7B, Mixtral 8x7B).

A. Geometric Extraction Protocol

The study extracted three geometric signatures from the final token of prompts:

Value Manifolds: Applied Principal Component Analysis (PCA) to the concatenated value vectors of the final layer. They measured the variance explained by the top principal components (PC1, PC1+PC2) to determine manifold dimensionality.
Key Orthogonality: Calculated the mean off-diagonal absolute cosine similarity of the key projection matrices ( $W_K$ ) to assess if keys form structured, orthogonal "hypothesis frames."
Attention Focusing: Measured the reduction in attention entropy from the first to the last layer to track evidence integration.

B. Experimental Conditions

Mixed-Domain vs. Domain-Restricted: Prompts were sampled from heterogeneous sources (Wikipedia, code, news) versus restricted to a single domain (e.g., mathematics) to test if domain restriction collapses the manifold, as predicted by wind-tunnel theory.
SULA (Synthetic Unary Likelihood Augmentation): A controlled in-context learning task where models were given labeled examples (e.g., "happy is positive") to update a latent probability. This allowed the authors to compute analytical Bayesian posteriors and compare them against the model's internal state and predictive entropy.
Causal Interventions: Targeted ablation of the "entropy-aligned axis" (the PC1 direction correlated with predictive entropy) in Pythia-410M to test if this geometry is a causal bottleneck or a readout.

3. Key Contributions

Persistence of Bayesian Geometry at Scale: Demonstrated that production LLMs exhibit the same low-dimensional value manifolds and orthogonal key frames found in synthetic wind-tunnel models, confirming these are not artifacts of toy tasks.
Functional Alignment with Uncertainty: Showed that during in-context learning (SULA), model states move systematically along the entropy-ordered manifold as evidence accumulates, correlating with analytical Bayesian posteriors.
Domain-Restriction Bridge: Proved that restricting prompts to a coherent domain collapses the value manifold to a single dimension (PC1+PC2 $\approx$ 70–95%), recovering the geometric regime of exact Bayesian inference.
Causal Boundary Characterization: Established that while the entropy-aligned axis is a privileged readout of uncertainty, it is not a singular computational bottleneck. Removing it disrupts local geometry but does not proportionally degrade Bayesian-like calibration, suggesting uncertainty is distributed.

4. Key Results

A. Value Manifold Dimensionality

Domain Restriction Effect: Under mixed-domain prompts, dimensionality varied by architecture (e.g., Mistral $\approx$ 15%, Llama $\approx$ 51%). However, domain-restricted prompts consistently collapsed the manifold toward 1D (PC1+PC2 $\approx$ 70–95%) across all models.
Pythia-410M Outlier: Exhibited near-complete collapse (PC1+PC2 $\approx$ 99.7%) even under mixed domains, suggesting its training on the Pile corpus induces a highly stable, low-dimensional uncertainty representation regardless of input.
Correlation: Manifold coordinates (PC1) strongly correlated with next-token predictive entropy ( $|\rho| \approx 0.59$ for Llama).

B. Inference-Time Bayesian Updating (SULA)

Models moved smoothly along the value manifold as the number of in-context examples ( $k$ ) increased.
The trajectory of the model's internal state matched the analytical Bayesian posterior entropy.
Controls: Lexical remapping preserved the trajectory, while shuffling labels or ablating evidence destroyed the monotone movement, proving the geometry responds to likelihood structure, not surface statistics.

C. Architectural Variations

Static vs. Dynamic Signatures:
- Static (Universal): Value manifolds and key orthogonality were present across all architectures (including GQA and MoE).
- Dynamic (Architecture-Dependent): Progressive attention focusing (entropy reduction) was strong in full-sequence Multi-Head Attention (MHA) (e.g., Phi-2: 86% reduction) but weakened or non-monotone in Grouped-Query Attention (GQA) and Sliding-Window architectures (e.g., Mistral: 20–30% reduction).
Training Data Quality: Models trained on curated data (Phi-2) showed sharper geometry (lower key cosine similarity, stronger focusing) than those trained on diverse web data (Llama).

D. Causal Interventions

Ablating the entropy-aligned axis in Pythia-410M destroyed the correlation between the vector projection and predictive entropy (geometry disrupted).
Crucially, this intervention did not significantly degrade the model's ability to perform Bayesian-like updates (SULA calibration remained intact).
Conclusion: The geometric manifold is a privileged readout of a distributed inference process, not the sole computational engine.

5. Significance and Implications

Unified Geometric Theory: The paper completes a trilogy establishing that Bayesian inference in transformers is not just a statistical approximation but is implemented via a stable, scalable geometric substrate (value manifolds + orthogonal keys).
Inductive Bias: The findings suggest that modern transformers possess a structural inductive bias toward representing inference geometrically, even without explicit Bayesian objectives or ground-truth posteriors.
Architectural Trade-offs: It highlights a trade-off between efficiency and interpretability. Efficiency-focused architectures (GQA, Sliding Window) preserve the static representational geometry (hypothesis frames) but attenuate the dynamic refinement mechanism (progressive focusing).
Future Directions: The work suggests that "content-based value routing" is the essential ingredient for probabilistic reasoning, applicable across Transformers, Mamba, and other sequence models. It opens new avenues for interpretability tools based on manifold coordinates and uncertainty readouts.

In summary, the paper confirms that the "Bayesian geometry" discovered in small synthetic models is a fundamental, scalable property of modern LLMs, organizing their approximate inference along low-dimensional, entropy-aligned manifolds.