Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

The Big Picture: Why AI Agents Sometimes Go Crazy

Imagine you have built a team of highly intelligent robots (AI agents) to work together on a complex project, like planning a space mission. You give them the exact same instructions, on the exact same computers, at the exact same time. You expect them to produce the exact same plan every time.

But sometimes, they don't. One robot says, "Let's go to Mars," and the other says, "No, let's go to the Moon," even though they started with the same data.

This paper investigates why this happens. The authors discovered that the problem isn't that the AI is "confused" or "hallucinating" in the usual sense. Instead, the problem is mathematical chaos caused by the tiny, invisible imperfections in how computers do math.

The Core Problem: The "Rounding Error Avalanche"

Computers don't think in perfect, infinite numbers like humans do. They use "floating-point" numbers, which are like a ruler with limited markings. If you have a number like 3.14159265..., a computer might have to round it to 3.14159.

In a simple math problem, rounding a tiny bit off doesn't matter. But in a Large Language Model (LLM), the data passes through dozens of layers of "neural networks" (think of these as a long, winding slide).

The Analogy: The Snowball Effect
Imagine a tiny speck of dust (a rounding error) landing on a snowball at the top of a mountain.

The Avalanche: As the snowball rolls down the mountain (through the AI's layers), that tiny speck of dust causes the snowball to pick up more snow. By the time it reaches the bottom, the speck has turned into a massive avalanche that destroys the village (the AI's output).
The Flat Road: Sometimes, that same speck of dust lands on a flat patch of road. It just sits there, and the snowball rolls past it without changing direction at all.

The paper found that LLMs are like a mountain with both steep cliffs and flat roads. A microscopic error can either vanish completely or explode into a massive change, depending on exactly where it lands.

The Three "Weather Zones" of AI

The authors identified three distinct zones where the AI behaves differently when you tweak the numbers slightly:

1. The "Frozen Lake" (Constant Regime)

What it is: You poke the AI with a tiny nudge, and nothing happens. The output stays exactly the same.
Analogy: Imagine pushing a heavy boulder on a frozen lake. You push it, but it doesn't move an inch. The AI is "frozen" in its decision.
Why it matters: This is good for stability, but it means the AI is ignoring tiny, potentially important details.

2. The "Whirlpool" (Chaotic Regime)

What it is: This is the dangerous zone. A tiny nudge (so small it's invisible to humans) causes the AI to spin wildly and output a completely different answer.
Analogy: Imagine dropping a single grain of sand into a swirling whirlpool. That grain doesn't just sink; it triggers a massive change in the water's flow, sending the whole whirlpool spinning in a new direction.
The Finding: The paper found that near the "decision lines" (where the AI is unsure between two answers, like "Yes" vs. "No"), the AI is incredibly fragile. A microscopic math error can flip the decision.

3. The "Clear Signal" (Signal-Dominated Regime)

What it is: You make a big change to the input (like changing the question entirely), and the AI responds logically. The "noise" of the math errors is drowned out by the actual meaning of the words.
Analogy: If you shout a new instruction over a loud radio, the static (math errors) doesn't matter. You hear the new message clearly.

The "Magic Coin" Discovery

The researchers did something fascinating. They tested the AI using different "directions" to poke it. In math, some directions are "strong" (easy to move) and some are "weak" (hard to move).

Old Theory: We thought the AI would only be unstable in the "strong" directions.
New Discovery: The AI is unstable everywhere. Whether you poke it in a "strong" direction or a "weak" direction, the result is the same: a tiny math error either vanishes or explodes.

The Metaphor: Imagine a house of cards. You might think it's only unstable if you blow on the top card. But this paper found that if you blow on any card, no matter how small or big, the whole house might collapse. The instability is a universal property of the AI's architecture, not just a specific weak spot.

Why Does This Matter for the Real World?

Multi-Agent Chaos: If you have a team of AI agents talking to each other, one agent might send a message with a tiny rounding error. The next agent receives it, and because of the "avalanche effect," it interprets the message completely differently. This explains why AI teams often fail to agree on plans (the 23-31% failure rates mentioned in the paper).
Safety Risks: If an AI is controlling a self-driving car or a medical diagnosis system, we need to know if a tiny math glitch could make it swerve into a wall or misdiagnose a patient.
It's Not Just "Bad Code": You can't fix this just by writing better code or using faster computers. It's a fundamental law of how computers handle numbers. Even if you use super-precise math (like using a ruler with a million markings instead of a thousand), you just push the problem to a smaller scale; the chaos is still there.

The Solution: The "Noise Filter"

The paper proposes a clever fix called Noise Averaging.

The Analogy: Imagine trying to hear a whisper in a windy room. The wind (random math errors) makes it hard to hear.

The Fix: Instead of listening once, you ask the AI to listen to the same whisper 100 times while the wind blows differently each time. Then, you average the results.
The Result: The random wind noise cancels itself out, and the true whisper (the actual AI logic) becomes clear.

The authors showed that by running the AI calculation a few times and averaging the results, they could "smooth out" the chaos and get a reliable, stable answer.

Summary

Large Language Models are like incredibly sensitive instruments. They are so finely tuned that the tiny, invisible "static" of computer math can sometimes cause them to flip-flop between answers or freeze completely. This isn't a bug; it's a feature of how digital math works at scale. To build reliable AI systems, we need to understand these "chaotic zones" and use tricks like averaging to filter out the noise.

1. Problem Statement

As Large Language Models (LLMs) are increasingly deployed in complex, multi-agent workflows, they exhibit high rates of unpredictability and non-reproducibility. Even with fixed random seeds and identical prompts, agents often produce contradictory outputs or fail to converge.

The Gap: While prior work attributes these failures to algorithmic limitations or semantic sensitivity, there is a lack of principled understanding regarding how floating-point rounding errors interact with Transformer computations to cause instability.
The Hypothesis: The authors hypothesize that a significant fraction of multi-agent failures stem from numerical instability induced by non-associative and non-deterministic floating-point arithmetic across heterogeneous hardware (e.g., different GPUs). They posit that microscopic rounding errors (at the scale of machine epsilon, $\sim 10^{-14}$ ) can trigger chaotic "avalanche effects" within the network layers, leading to binary outcomes of either rapid amplification or complete attenuation.

2. Methodology

The authors employ a rigorous, layer-wise analysis of LLM inference to quantify stability, moving beyond standard worst-case condition numbers which are often too pessimistic for high-dimensional spaces.

Metric Definition: They introduce the Absolute Directional Condition Number ( $\kappa_{abs}$ ), defined as the norm of the directional derivative. This measures the immediate absolute change in the output for a unit perturbation in a specific direction $v$ :
$\kappa_{abs}(f, x, v) \approx \frac{\|f(x + \epsilon v) - f(x)\|_2}{\epsilon}$
To avoid probabilistic sampling noise, they analyze the Logits (pre-softmax outputs) rather than final token predictions.
Experimental Setup:
- Models: Meta-Llama-3.1-8B-Instruct and OpenAI-GPT-OSS-20B.
- Datasets: TruthfulQA (general reasoning) and AdvBench (adversarial prompts).
- Hardware: NVIDIA RTX A5000 GPUs (for Llama) and Intel CPU (for GPT-OSS to enforce Float32 precision constraints).
- Precision: Experiments conducted in Float32, BFloat16, and Float64.
Analysis Techniques:
- Spectral Analysis: Probing perturbations along singular vectors of the Jacobian to see if sensitivity follows the singular value spectrum.
- Micro-Continuity: Sweeping perturbations at sub-ULP (Unit in the Last Place) scales to observe "stalls" (constant outputs) vs. "jumps."
- Decision Boundary Mapping: Visualizing 2D perturbation planes near "near-tie" points (where top two token logits are equal) to measure fragmentation and crossing density.

3. Key Contributions

Identification of Chaotic Dynamics: The paper demonstrates that LLMs exhibit chaotic behavior where perturbations at the scale of floating-point epsilon either amplify exponentially or vanish completely within early Transformer layers. Directional condition numbers can exceed $10^6$ , far surpassing theoretical singular values.
Characterization of Three Stability Regimes: The authors define three distinct operating regimes based on perturbation scale and input:
- Constant Regime: Perturbations fall below an input-dependent threshold; rounding errors dissipate, resulting in bitwise-constant outputs.
- Chaotic Regime: Rounding errors dominate, driving output divergence and erratic decision flips.
- Signal-Dominated Regime: True input variations are large enough to override numerical noise.
Universal Scale-Dependence: They prove that instability is scale-driven rather than spectrum-driven. Stability thresholds are determined by floating-point quantization effects (ULP spacing) rather than the Jacobian's spectral properties.

4. Key Results & Findings

The "Avalanche Effect":
- In the Chaotic Regime (microscopic perturbations, e.g., $\epsilon = 10^{-10}$ ), the directional structure of the Jacobian collapses. Perturbations along high-sensitivity directions (high singular values) and low-sensitivity directions (near-zero singular values) exhibit nearly identical amplification trajectories.
- Once a perturbation survives the initial rounding interface, it cascades through the network depth, becoming weakly dependent on its initial direction.
Micro-Continuity and Staircase Behavior:
- Analysis reveals a "staircase" pattern in output representations. Most consecutive perturbation steps produce zero change (constant-like plateaus), but rare steps trigger discrete, massive jumps.
- This explains the statistical anomaly where the median instability is 0.0, but the mean instability is orders of magnitude larger (e.g., $10^6$ ).
Fractal Decision Boundaries:
- Near decision boundaries (where $L_1 \approx L_2$ ), microscopic perturbations fragment the output space into hundreds of disconnected regions.
- Flip Frequency: Adjacent grid cells in perturbation planes show prediction flips in ~16% of cases.
- Crossing Density: Boundary crossings are 50x higher than expected in smooth functions, creating "salt-and-pepper" decision maps.
Universality Across Singular Vectors:
- Despite singular values spanning six orders of magnitude (from $\sigma \approx 600$ to $\sigma \approx 0$ ), the maximum stable perturbation magnitude ( $s_{max}$ ) remains nearly constant at $\sim 10^{-10}$ across all 4,096 singular vectors.
- This confirms that stability is governed by input-dependent rounding dynamics, not the model's spectral structure.
Impact of Precision:
- Changing precision (BFloat16 vs. Float64) shifts the location of the regime boundaries (i.e., at what $\epsilon$ the chaos begins) but does not eliminate the chaotic behavior. Float64 pushes the instability to smaller scales, but the fundamental scale-dependent nature remains.
Mitigation via Noise Averaging:
- The authors propose averaging multiple forward passes with injected noise. This cancels out stochastic rounding errors (which vary across paths) while preserving the true algorithmic sensitivity.
- Result: Averaging just 10–100 samples reduces the estimated condition number from chaotic levels ( $>900$ ) to stable levels ( $\sim 600$ ), converging to the true singular value.

5. Significance and Implications

Theoretical Shift: This work challenges the classical view that model sensitivity is defined by the Jacobian spectrum. Instead, it establishes that for LLMs, numerical precision and perturbation scale are the primary drivers of stability.
Multi-Agent Reliability: The findings explain the 23–31% failure rates in multi-agent systems (e.g., AutoGen, MetaGPT). When agents exchange embeddings across heterogeneous hardware, non-deterministic reductions and rounding errors cause identical inputs to traverse different computational paths, leading to divergent outcomes.
Safety-Critical Applications: For applications requiring reproducibility (e.g., medical diagnosis, legal reasoning), relying on single-pass inference is fundamentally flawed due to the chaotic nature of floating-point arithmetic in deep networks.
Future Directions: The paper suggests that building robust systems requires either architectural changes to expand "Constant Regions," runtime detection of chaotic boundaries, or the adoption of noise-averaging techniques to recover deterministic sensitivity estimates.

In conclusion, the paper establishes that numerical instability is a fundamental constraint on LLM reproducibility, driven by a chaotic interplay between floating-point rounding and deep network propagation, rather than mere algorithmic fragility.