NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Imagine a Large Language Model (like the AI you're talking to right now) as a massive, multi-story factory. Inside this factory, information flows from one floor to the next.

Most people focus on the "Attention" mechanism—the part that decides which words in a sentence are important to each other. But there's another huge part of the factory called the Feed-Forward Network (FFN). It takes up most of the space and uses most of the electricity, yet nobody really understood how it was working. It was like a black box: you put data in, and magic happened, but the gears inside were a mystery.

This paper introduces a new tool called NerVE (Nonlinear Eigenspectrum Dynamics in FFNs) to peek inside that black box. Think of NerVE as a high-tech spectral X-ray that lets us see how the factory rearranges its internal energy.

Here is the breakdown of what they found, using simple analogies:

1. The Problem: The "Traffic Jam"

Imagine the data flowing into the FFN is like a river of water. Before it hits the FFN, the water is mostly flowing in just one or two main channels. It's a narrow, fast stream. If the factory only used these few channels, it would be very inefficient and could only do simple things.

2. The Magic: The "Variance Re-injection"

The paper discovered that the FFN has a special "nonlinear" switch (an activation function like GELU or ReLU). When the water hits this switch, something amazing happens: it explodes outward.

Instead of staying in those one or two channels, the FFN takes that concentrated energy and sprays it out into hundreds of new, previously empty channels.

The Analogy: Imagine a firehose spraying water into a single bucket. The FFN is like a magical nozzle that takes that single stream and turns it into a wide, gentle mist that fills the entire room.
Why it matters: This "spraying" wakes up dormant parts of the AI's brain. It allows the model to use its full size and complexity, rather than just a tiny fraction of it.

3. The Four "Gauges" of NerVE

To measure this, the authors created four simple dials (metrics) to read the factory's dashboard:

Spectral Entropy (The "Spread" Gauge):
- Low reading: The energy is stuck in a few channels (boring, inefficient).
- High reading: The energy is spread out evenly across many channels (efficient, creative).
- Finding: The FFN consistently turns the dial from "Low" to "High."
Participation Ratio (The "Team Size" Gauge):
- Low reading: Only a few workers are doing all the heavy lifting.
- High reading: A huge team of workers is all contributing.
- Finding: The FFN recruits more workers, making the team bigger and more effective.
Eigenvalue Early Enrichment (The "Boss" Gauge):
- High reading: One "Boss" eigenvalue is hogging all the power (Top-heavy).
- Low reading: Power is shared fairly among everyone.
- Finding: The FFN dethrones the "Boss" and shares the power, making the system more stable.
Jensen-Shannon Divergence (The "Transformation" Gauge):
- This measures how much the shape of the data changed between entering and leaving the FFN.
- Finding: A high reading means the FFN did a lot of heavy lifting to reshape the data.

4. The Surprising Discoveries

The "Repair Crew" vs. The "Refinement Crew"
The authors found that the Optimizer (the algorithm that teaches the AI) changes how the FFN works:

AdamW (The old standard): It's a bit clumsy. It often lets the data get "collapsed" or stuck before it even reaches the FFN. So, the FFN has to work overtime as a Repair Crew, frantically trying to fix the mess and wake up the dormant channels. It works, but it's exhausting.
Muon (The new star): It's a smooth operator. It keeps the data flowing nicely before it even hits the FFN. The FFN doesn't need to do emergency repairs; it just acts as a Refinement Crew, gently polishing the data to make it even better. This is why models trained with Muon often perform better.

Normalization Matters
If you remove the "LayerNorm" (a standard stabilizer in AI), the FFN has to work even harder.

With GELU (a smooth activation), the factory gets stuck in a traffic jam (spectral inertia).
With ReLU (a sharper activation), the FFN goes into overdrive, violently breaking the traffic jam and forcing the data into new channels. It's like a bouncer kicking people out of a VIP lounge to let everyone in.

5. Why Should You Care?

This isn't just about math; it's about building better AI.

Stop Guessing: Instead of trial-and-error (trying random settings and hoping for the best), engineers can now use NerVE to look at the "gauges" during training. If the "Spread" gauge is low, they know the model isn't using its full brain and can fix the architecture.
Better Choices: It tells us that choosing the right optimizer (like Muon) or the right activation function (like GELU vs. ReLU) fundamentally changes how the AI thinks, not just how fast it learns.
Universal Truth: They tested this on text models (LLMs) and image models (MLP-Mixers), and the rule held true: Nonlinearity is the key that unlocks the AI's potential by spreading energy across its entire brain.

In a nutshell: The FFN isn't just a passive pipe; it's an active energy distributor. It takes a concentrated beam of information and scatters it across the whole system, allowing the AI to think in high definition. NerVE is the tool that finally lets us see that distribution happening in real-time.

1. Problem Statement

While Large Language Models (LLMs) have achieved remarkable success, the internal dynamics of their Feed-Forward Networks (FFNs) remain poorly understood. FFNs dominate the parameter budget and computational cost of Transformer architectures, yet they are often treated as simple black boxes that merely rescale activations.

The Gap: Existing tools focus heavily on attention mechanisms or static geometric properties (e.g., piecewise-affine partitions). There is a lack of systematic, memory-efficient methods to characterize how nonlinearities in FFNs dynamically reorganize, compress, and redistribute information in high-dimensional latent spaces during training.
The Challenge: Understanding how FFNs regulate information flow, specifically how they handle variance distribution across eigenmodes, and how architectural choices (normalization, activation functions) and optimizer geometries influence this process.

2. Methodology: The NerVE Framework

The authors introduce NerVE (Nonlinear Eigenspectrum Dynamics), a unified, online, and memory-efficient framework for analyzing FFN latent geometry via eigenspectrum analysis.

Core Workflow

Activation Collection: Collects pre-activation ( $W_{up}x + b_1$ ) and post-activation ( $\sigma(W_{up}x + b_1)$ ) matrices for every token in a batch.
Covariance Computation: Computes the unbiased sample covariance matrix $\Sigma$ for the flattened token embeddings ( $N \times D$ ) for both pre- and post-activation states.
Eigendecomposition: Performs eigendecomposition on $\Sigma$ to extract eigenvalues $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_D$ .
Metric Calculation: Derives four scale-invariant, distribution-aware metrics from the normalized eigenvalue distribution:
- Spectral Entropy (SE): Measures the uniformity of variance distribution (dispersion vs. concentration). High SE indicates a flat, uniform spectrum; low SE indicates a collapsed, low-rank representation.
- Participation Ratio (PR): Quantifies the effective dimensionality. It measures how many directions meaningfully contribute to the total variance.
- Eigenvalue Early Enrichment (EEE): Measures "top-heaviness" (anisotropy). It tracks how rapidly the leading principal directions accumulate variance compared to a uniform distribution. High EEE implies variance is concentrated in a few dominant modes.
- Jensen-Shannon (JS) Divergence: Measures the distributional shift between pre- and post-activation spectra, quantifying the geometric restructuring performed by the nonlinearity.

3. Key Contributions

Conceptual Insight: The paper demonstrates that FFN nonlinearities do not merely rescale activations but actively reinject variance across eigenmodes. This process reawakens under-utilized directions in the latent space, effectively flattening the eigenspectrum and improving feature disentanglement.
Optimizer Geometry: A novel finding is that optimizers modulate the extent of this variance reinjection. Some optimizers (like AdamW) induce pre-activation spectral collapse, forcing the FFN nonlinearity into a "repair" mode, while others (like Muon) maintain well-conditioned pre-spectra, allowing the nonlinearity to act as a "refinement" tool.
Diagnostic Framework: NerVE provides a lightweight, online monitoring tool that correlates spectral signatures with generalization performance, enabling architectural selection without full convergence training.
Cross-Architecture Generalization: The framework is validated not only on Transformer-based models (GPT-2, LLaMA) but also on non-Transformer architectures (MLP-Mixer), proving that these spectral dynamics are fundamental to deep feed-forward layers.

4. Key Results & Findings

A. Variance Reinjection and Spectral Flattening

Mechanism: Post-activation spectra consistently show increased SE and PR and decreased EEE compared to pre-activation.
Interpretation: The nonlinearity redistributes variance from dominant modes into previously inactive directions, "flattening" the spectrum and expanding the effective latent dimensionality.

B. Activation Functions & Normalization

GELU vs. ReLU: Both follow similar trajectories, but GELU explores a broader subspace (higher final PR), correlating with lower perplexity.
Normalization-Free Models:
- GELU: Exhibits "spectral inertia" in early layers (failure to reinject variance), leading to spectral bottlenecks and higher perplexity.
- ReLU: Aggressively compensates for the lack of LayerNorm by reinjecting massive variance (PR gains up to 300x), effectively breaking spectral bottlenecks and narrowing the performance gap with normalized models.

C. Architectural Choices

LayerNorm Placement: PreLN converts added width into usable dimensions most effectively. PostLN shows diminishing returns at higher widths, leading to spectral concentration.
Positional Encoding: RoPE prevents mid-to-deep spectral collapse, maintaining high PR in deeper layers, whereas models without positional encoding (NoPE) suffer from representation collapse in the middle layers.
Weight Geometry: Spectral Normalization (SNorm) induces sustained spectral flattening, leading to better performance than Weight Normalization or Hyperspherical Normalization (which causes early overshooting).

D. Optimizer-Dependent Dynamics

Muon: Maintains high-dimensional, isotropic pre-activation spectra. The FFN nonlinearity performs minimal "repair," acting instead as a stabilizer. This correlates with the best perplexity.
AdamW: Induces early-layer spectral collapse (low pre-activation PR). The FFN nonlinearity must expend capacity to "repair" this collapse, resulting in aggressive but incomplete recovery and worse perplexity.
Dion: Falls between AdamW and Muon, showing intermediate spectral behavior.

E. Predictive Power

NerVE metrics (specifically Post-activation PR and SE) show strong negative correlations ( $|r| > 0.9$ ) with validation loss across different model widths and architectures.
This allows NerVE to serve as an early diagnostic proxy for ranking architectural configurations before full training.

5. Significance and Impact

Theoretical Advancement: Shifts the understanding of FFNs from static "rescalers" to active variance reinjectors that govern latent space utilization.
Practical Utility: Provides actionable insights for practitioners:
- Optimizer Selection: Muon is superior for maintaining spectral health.
- Architecture Design: PreLN and RoPE are critical for preventing spectral collapse in deep networks.
- Activation Choice: ReLU can compensate for missing normalization layers in specific regimes.
Efficiency: The framework is memory-efficient (processing layers sequentially) and can be applied online during training, offering a new lens for debugging and optimizing LLMs beyond trial-and-error.

In summary, NerVE reveals that the "magic" of LLMs lies partly in how FFN nonlinearities dynamically reshape the eigenspectrum to maximize latent space utilization, a process heavily influenced by the choice of optimizer and architectural constraints.