Non-normal spectral signatures of instability in neural… — Plain-Language Explanation

The Big Picture: Why Do AI Models Sometimes "Spaz Out"?

Imagine you are teaching a robot to walk. Usually, it learns smoothly. But sometimes, it suddenly trips, flails its arms wildly, loses its balance, and then eventually finds its footing again. In the world of AI (neural networks), these are called training instabilities. You see them as sudden spikes in error (loss) or the model shaking back and forth before settling down.

For a long time, scientists thought they understood why this happened. They believed it was like a car going too fast over a bumpy road: if the bumps (mathematical "sharpness") were too high for the car's speed (learning rate), the car would crash.

This paper argues that this old explanation is incomplete. It says that even if the car is driving at a "safe" speed and the road looks smooth, the car can still flip over. Why? Because the car's steering mechanism is non-normal.

The Core Concept: "Non-Normal" Steering

To understand "non-normal," let's use a swing analogy.

The Old View (Normal Systems): Imagine a simple swing. If you push it, it swings back and forth. If the swing is stable, it eventually stops. If you push it too hard, it goes too high and falls. In this world, you only need to check how fast the swing is moving (the spectral radius) to know if it will crash. If the speed is low enough, you are safe.
The New View (Non-Normal Systems): Now, imagine a swing that is attached to a weird, springy, twisting pole. If you give it a tiny nudge, it doesn't just swing back and forth. Instead, the nudge gets amplified wildly for a few seconds before it finally settles down.
- Even if the swing is technically "stable" (it won't fly off forever), that initial transient amplification can be huge.
- The paper calls this non-normality. It means the system has a hidden "spring" that can temporarily blow up a small mistake into a massive error, even if the long-term math says everything is fine.

The Two Main Culprits: Adam and Momentum

The paper looks at two popular ways AI learns: Adam and SGD with Momentum. It proves mathematically that both of these methods create this "twisting pole" effect.

Adam: This optimizer tries to adjust its learning speed for every single part of the model individually. The paper shows that because it changes the "rules" for each part differently, it creates a mismatch between the map of the terrain (the Hessian) and the rules of the road (the preconditioner). This mismatch creates the "twisting pole" that causes temporary explosions in error.
SGD with Momentum: This method gives the model "inertia," like a heavy wheel. The paper shows that the way this momentum is stored and used creates a structure where a small push can be magnified before it dies out.

The New Warning System: The "Condition Number"

Since the old way of checking stability (looking at the speed/spectral radius) fails to catch these temporary explosions, the authors propose a new tool.

The Old Tool (Spectral Radius): This is like checking the speedometer. It tells you if the car is moving too fast eventually. But it misses the fact that the car might flip over right now due to a weird bump.
The New Tool (Eigenvector Condition Number, $\kappa(V)$ ): The authors introduce a new number they call $\kappa(V)$ $κ (V)$ .
- Analogy: Think of this as a "Sensitivity Meter."
- If the meter is low, the system is like a sturdy boat: a small wave just makes it rock a little.
- If the meter is high, the system is like a house of cards: a tiny breeze (a small error) can cause the whole thing to collapse temporarily.

What the Experiments Showed

The researchers tested this on a simple AI model (a two-layer network) to see if their theory held up.

The "Safe" Speed Trap: They ran the AI with settings that the old math said were "stable" (the speedometer was fine).
The Result: The AI still had massive spikes in error (it tripped and fell).
The New Tool Worked: While the old speedometer stayed calm, the new Sensitivity Meter ( $\kappa(V)$ ) went crazy. It jumped up by 10 times (an order of magnitude) right before the AI tripped.
The Conclusion: The old tool couldn't tell the difference between a stable run and an unstable one. The new tool could clearly separate them.

Special Cases: The "Tipping Points"

The paper also talks about Exceptional Points. Imagine a tightrope walker. Usually, they are just unsteady. But at a specific point, the rope and the wind align perfectly, and the walker becomes incredibly unstable.

The paper says these "perfect alignment" points are the mathematical limit where the Sensitivity Meter goes to infinity.
While the AI doesn't usually hit these exact points, it often gets close to them, which is why the Sensitivity Meter spikes so high before a crash.

Summary of the Takeaway

The Problem: AI models often crash or spike in error even when they are supposed to be stable according to traditional math.
The Cause: The math behind popular AI optimizers (Adam, Momentum) is "non-normal." This means small errors can get temporarily amplified into huge mistakes before the system corrects itself.
The Solution: We need a new way to measure stability. Instead of just checking the "speed" (spectral radius), we should check the "sensitivity" (the condition number $\kappa(V)$ ).
The Benefit: This new measure acts as an early warning system. It can tell you, "Hey, the system is about to have a temporary explosion of error," even if the long-term math says you are fine.

Note: The authors clarify that this is a diagnostic tool. It explains why the spikes happen and gives a warning, but it doesn't automatically fix them. It's like a smoke detector: it tells you there's a fire, but you still need to know how to put it out (e.g., by adjusting learning rates or clipping gradients).

Technical Summary: Non-normal spectral signatures of instability in neural network training dynamics

Problem Statement
Training instabilities in deep neural networks—manifesting as loss spikes, oscillatory convergence, and gradient pathologies—are empirically common but lack a rigorous operator-theoretic explanation. The standard theoretical framework relies on the eigenspectrum of the Hessian matrix ( $H$ ), assuming that stability is determined solely by the spectral radius $\rho(J) < 1$ of the update operator. This framework implicitly assumes the update operator is normal (i.e., its eigenvectors are orthogonal), a condition that holds for vanilla gradient descent but fails for practically used optimizers like Adam and SGD with momentum. Consequently, the spectral radius criterion may fail to detect transient amplification of perturbations, where errors grow significantly even when all eigenvalues lie strictly within the stability boundary.

Methodology
The paper applies non-normal stability theory, drawing from fluid mechanics and numerical analysis, to the linearized update operators of neural network optimizers.

Operator Formulation: The authors derive the linearized update operators ( $J$ $J$ ) for Adam and SGD with momentum.
- For Adam, the operator is $J = I - \eta M^{-1}H$ , where $M$ is the diagonal adaptive preconditioner.
- For SGD with momentum, the operator is defined on an augmented state space $(\theta, v)$ , resulting in a block matrix structure.
Non-Normality Analysis: The authors prove that these operators are generically non-normal ( $J^\dagger J \neq J J^\dagger$ $J^{†} J \neq = J J^{†}$ ).
- For Adam, non-normality is controlled by the commutator $[H, M]$ . Since $H$ is generally non-diagonal and $M$ is coordinate-dependent, they do not commute.
- For SGD with momentum, non-normality arises intrinsically from the off-diagonal block structure of the augmented state-space update, independent of the Hessian.
Stability Metrics: Instead of relying solely on the spectral radius $\rho(J)$ , the paper utilizes the eigenvector condition number $\kappa(V) = \|V\| \cdot \|V^{-1}\|$ (where $V$ is the matrix of eigenvectors) and the $\epsilon$ -pseudospectrum. These tools quantify transient growth bounds and spectral sensitivity to perturbations.
Numerical Validation: Experiments were conducted on a two-layer MLP (241 parameters) trained on a synthetic regression task using Adam and SGD with momentum. The study tracked $\kappa(V)$ , $\rho(J)$ , and the Hessian's largest eigenvalue $\lambda_{\max}(H)$ against observed loss spikes.

Key Contributions and Results

Proof of Generic Non-Normality: The paper establishes that the linearized update operators for Adam and SGD with momentum are generically non-normal. For Adam, this is a direct consequence of the non-commutativity between the Hessian and the adaptive preconditioner.
Transient Amplification Bound: The authors derive a conservative precursor bound (Theorem 2) showing that transient amplification can occur for $O(\log \kappa(V) / \log(1/\rho))$ steps even when $\rho(J) < 1$ . This explains how loss spikes can occur despite the spectral radius suggesting stability.
$\kappa(V)$ as an Early-Warning Indicator: Numerical experiments demonstrate that while the spectral radius $\rho(J)$ remains nearly constant (e.g., in the range $[1.00, 1.04]$ ) and fails to distinguish between stable and unstable training phases, the eigenvector condition number $\kappa(V)$ separates these phases by approximately one order of magnitude. High values of $\kappa(V)$ (50–500) correlate with instability phases, while low values (10–30) correlate with stable convergence.
Complementarity with Sharpness: The classical sharpness criterion ( $\lambda_{\max}(H) > 2/\eta$ ) provides a binary threshold signal consistent with the "Edge of Stability" literature. In contrast, $\kappa(V)$ provides a continuous severity measure of non-normal amplification within the unstable regime, offering complementary diagnostic information.
Exceptional Points as Limits: The paper identifies Exceptional Points (EPs)—where eigenvalues and eigenvectors coalesce—as the mathematical limit where $\kappa(V) \to \infty$ . The authors argue that EPs are not the general mechanism for loss spikes but rather represent the extreme limit of the non-normal framework; training trajectories typically pass near EPs, causing large but finite $\kappa(V)$ values.
Quasi-Static Approximation Limits: For Adam, the authors note that the quasi-static approximation (freezing the preconditioner $M$ ) fails in early training, leading to monotonic growth in $\rho(J)$ that does not reflect actual instability. The non-normal precursor framework is most applicable in the late-training regime where the preconditioner has converged.

Significance and Claims
The paper claims to establish non-Hermitian operator theory as a useful and underexplored framework for understanding neural network optimization stability.

It offers a diagnostic language (via $\kappa(V)$ and pseudospectra) to explain phenomena that the standard spectral radius criterion cannot detect.
It provides a proof-of-concept benchmark demonstrating that transient amplification is a structural consequence of adaptive preconditioning and momentum, rather than a specific artifact of loss geometry.
The authors position their work as a conservative precursor bound; they hypothesize that linearized transient growth corresponds to nonlinear loss spikes but acknowledge this requires empirical validation rather than theoretical proof.
The paper suggests that practical techniques like gradient clipping and learning rate warmup can be reinterpreted as implicit strategies for navigating the pseudospectral stability boundary, though it does not claim to have designed these techniques based on this theory.

The work concludes that while the spectral radius is necessary, it is insufficient for stability analysis in non-normal systems, and $\kappa(V)$ serves as a critical, continuous measure of instability severity.

Non-normal spectral signatures of instability in neural network training dynamics