Asymptotic behavior of eigenvalues of large rank… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to understand the "personality" of a massive, chaotic crowd. In the world of mathematics and machine learning, this crowd is represented by a Deep Neural Network (DNN)—a computer program that learns to recognize images, translate languages, or drive cars.

To understand how this network thinks, mathematicians look at its "weight matrix." Think of this matrix as a giant spreadsheet of numbers that determines how information flows through the network.

This paper is about a specific mathematical puzzle: What happens to the "loud voices" (outliers) in this giant spreadsheet when the network gets huge and the noise gets complicated?

Here is the breakdown using simple analogies:

1. The Setup: The Crowd and the Signal

Imagine a giant stadium filled with $N$ people.

The Random Noise ( $R$ ): Most of the people are just chatting randomly. They don't know each other, and their conversations are pure noise. In math, this is a "Wigner matrix." If you look at the whole crowd, the noise creates a predictable, smooth shape (like a bell curve or a semicircle).
The Signal ( $S$ ): But, hidden in the crowd, there are some people shouting specific, coordinated messages. These are the "outliers." In a trained AI, these represent the actual patterns the AI has learned (like recognizing a cat vs. a dog).

The total matrix $W$ is the sum of the random noise and the signal: $W = \text{Noise} + \text{Signal}$ .

2. The Old Theory vs. The New Reality

For a long time, mathematicians had a rule for how to predict the "loud voices" (the eigenvalues) in this crowd.

The Old Rule: They assumed the "Signal" was very simple. Imagine only 3 or 4 people were shouting specific messages, while everyone else was just random noise. This is called a "low-rank" perturbation. The math was easy: you could predict exactly where those 3 or 4 loud voices would end up.
The Real World Problem: In real, modern AI networks, it's not just 3 people shouting. It's hundreds or thousands of people shouting, and the number of shouters grows as the stadium gets bigger. The "Signal" isn't a few spikes; it's a whole section of the crowd that is slightly different from the noise.

The old math broke down because it couldn't handle a signal that was "full rank" (everywhere) but still had a distinct structure.

3. The Breakthrough: Mapping the Chaos

The authors of this paper (Afanasiev, Berlyand, and Kiyashko) developed a new way to map this chaos.

The Analogy of the "Magic Lens" ( $\Phi$ ):
Imagine you have a special pair of glasses (a mathematical function called $\Phi$ ).

If you look at a specific "shouter" in the Signal ( $S$ ) through these glasses, the glasses tell you exactly where that voice will appear in the final noisy crowd ( $W$ ).
The Big Discovery: Even when there are thousands of shouters (not just a few), and even when the background noise is complex, this "Magic Lens" still works!
The paper proves that if you know where the signal voices are in the pure signal, you can calculate exactly where they will end up in the noisy matrix, provided the number of signal voices grows slowly enough compared to the total size of the crowd.

4. Why This Matters for AI (Pruning)

Why do we care about this?

Pruning: To make AI faster and cheaper, engineers try to "prune" (cut out) the useless parts of the network. They look at the weight matrix and say, "These numbers are just noise; let's delete them."
The Risk: If you use the old math, you might accidentally cut out a "shouter" (a vital pattern) because you thought it was just noise. Or, you might keep noise thinking it's a signal.
The Solution: This new math gives engineers a more accurate map. It tells them exactly which parts of the network are the "real signal" and which are "noise," even when the signal is complex and large. This helps create AI that is smaller, faster, and doesn't lose its brain power when you trim the fat.

Summary

Think of the paper as a new GPS for navigating a noisy city.

Old GPS: Only worked if there were a few famous landmarks (spikes) in a sea of fog.
New GPS: Works even if the "landmarks" are a whole neighborhood of distinct buildings, as long as you know the rules of the city.

The authors proved that even in a massive, complex, and noisy neural network, the "important" parts of the math can still be predicted with high precision. This bridges the gap between theoretical math and the messy reality of training real-world AI.

1. Problem Formulation and Motivation

Context:
The paper addresses the spectral properties of deformed Wigner random matrices, which are central to understanding the weight matrices of trained Deep Neural Networks (DNNs). In modern DNNs, weight matrices often exhibit a structure $W = \frac{1}{\sqrt{N}}R + S$ , where $R$ is a random matrix (representing noise or initialization) and $S$ is a deterministic or highly correlated matrix (representing the "signal" learned during training).

The Gap:
Previous theoretical results in Random Matrix Theory (RMT) regarding such deformed matrices generally fell into two categories:

Fixed Rank Perturbations: The signal matrix $S$ has a fixed number of non-zero eigenvalues (spikes) as $N \to \infty$ (e.g., Capitaine et al., P´ech´e).
Delta-Measure Background: The background spectrum of $S$ is a delta function at zero, even if the rank grows.

The Challenge:
Empirical studies of real DNNs (e.g., Berlyand et al., Martin & Mahoney) show that the "signal" matrix $S$ often has a growing number of outlier eigenvalues (spikes) that scale with $N$ (specifically $r(N) \to \infty$ but $r(N) = o(N)$ ), and the bulk distribution of $S$ is non-trivial (not a delta function). Existing theories fail to describe the asymptotic behavior of these "large rank" perturbations with general bulk distributions.

Objective:
The authors aim to derive the asymptotic behavior of the eigenvalues of $W$ when:

The perturbation $S$ has a growing rank $r(N) \to \infty$ .
The empirical spectral distribution (ESD) of $S$ converges to a general measure $\nu_0$ (not just a delta function).
The number of outliers grows, but remains a vanishing fraction of the total dimension ( $r(N)/N \to 0$ ).

2. Mathematical Framework and Assumptions

The matrix $W$ is defined as:
$W = \frac{1}{\sqrt{N}}R + S$
where $R$ is a real symmetric Wigner matrix (entries i.i.d. with mean 0, variance $\sigma^2$ ) and $S$ is a real symmetric non-random matrix.

Key Assumptions:

Bulk Convergence: The ESD of $S$ , denoted $\nu$ , converges weakly to a limiting measure $\nu_0$ .
Outlier Growth: There are $r(N)$ eigenvalues of $S$ outside the support of $\nu_0$ . As $N \to \infty$ , $r(N) \to \infty$ and $r(N) = o(N)$ .
Outlier Distribution: The scaled difference measure $\frac{N}{r}(\nu - \nu_0)$ converges weakly to a signed measure $\nu_1$ . This implies the outliers have a limiting distribution, and the "missing" mass inside the bulk balances the mass of the outliers.

Definitions:

Stieltjes Transform: $g_\tau(z) = \int \frac{d\tau(t)}{t-z}$ .
Limiting Equation: The limiting ESD $\mu_0$ of $W$ satisfies $g_{\mu_0}(z) = g_{\nu_0}(\omega_{\mu_0}(z))$ , where $\omega_{\tau}(z) = z + \sigma^2 g_\tau(z)$ .
Mapping Function: $\Phi(z) = z - \sigma^2 g_{\nu_0}(z)$ . This function maps the location of an outlier in $S$ to its expected location in $W$ .

3. Methodology

The authors employ a rigorous analytical approach combining resolvent methods, Stieltjes transform analysis, and interpolation techniques.

Step 1: Pre-limiting Equation with Error Bounds
The authors first establish a precise equation for the Stieltjes transform of the ESD of $W$ before taking the limit $N \to \infty$ .

They prove that for the deformed Gaussian Orthogonal Ensemble (GOE), the equation $E[g_\mu(z)] - E[g_\nu(z + \sigma^2 E[g_\mu(z)])] = O(N^{-1})$ holds.
They extend this to general Wigner matrices using an interpolation method (homotopy between the Wigner matrix and a GOE matrix). By defining a path $W(t) = \frac{1}{\sqrt{N}}(\sqrt{t}R + \sqrt{1-t}H) + S$ , they show that the derivative of the Stieltjes transform with respect to the interpolation parameter $t$ is of order $O(N^{-1})$ . This allows them to transfer results from the Gaussian case to the general Wigner case.

Step 2: Analysis of the Outlier Measure
To handle the growing rank $r(N)$ , they analyze the scaled difference measure $\tilde{\mu}_1 = \frac{N}{r}(\mu - \mu_0)$ .

They expand the pre-limiting equation around the limiting measures $\mu_0$ and $\nu_0$ .
Using the relation $\Phi(\omega_{\mu_0}(z)) = z$ , they derive a limiting equation for the Stieltjes transform of the outlier measure $\mu_1$ in terms of $\nu_1$ :
$g_{\mu_1}(z) = g_{\nu_1}(\omega_{\mu_0}(z)) \cdot \omega'_{\mu_0}(z)$
This establishes that the distribution of outliers in $W$ is a transformation of the distribution of outliers in $S$ via the map $\omega_{\mu_0}$ .

Step 3: Individual Eigenvalue Convergence
To prove the convergence of individual eigenvalues (Theorem 2.2), the authors use a counting argument combined with the convergence of the measures.

They construct test functions (smooth step functions) to count the number of eigenvalues in specific intervals.
By leveraging the weak convergence of the signed measures $\tilde{\mu}_1$ and $\tilde{\nu}_1$ , they show that the number of eigenvalues of $W$ in a transformed interval matches the number of eigenvalues of $S$ in the pre-image interval.
This allows them to bound the distance between the $j$ -th eigenvalue of $W$ and the transformed $j$ -th eigenvalue of $S$ .

4. Key Results

Theorem 2.1: Limiting Distribution of Outliers
The signed measure $\frac{N}{r}(\mu - \mu_0)$ converges weakly to a non-random measure $\mu_1$ . For any set $\Delta$ disjoint from the bulk support:
$\mu_1(\Delta) = \nu_1(\omega_{\mu_0}(\Delta))$
This result generalizes previous findings by allowing both a general bulk distribution $\nu_0$ and a growing number of outliers.

Theorem 2.2: Asymptotic Behavior of Individual Outliers
Let $\lambda_j(S)$ be the eigenvalues of $S$ and $\lambda_j(W)$ be the eigenvalues of $W$ . If the outliers of $S$ converge to a support interval, then for any sequence of indices $j(N) \leq r(N)$ :
$\lambda_{j(N)}(W) - \Phi(\lambda_{j(N)}(S)) \xrightarrow{P} 0 \quad \text{as } N \to \infty$
where $\Phi(z) = z - \sigma^2 g_{\nu_0}(z)$ .

Interpretation: The location of an outlier in the deformed matrix $W$ is determined by applying the function $\Phi$ to the corresponding outlier in $S$ . If $\Phi'(\theta) > 0$ , the outlier separates from the bulk; otherwise, it merges into the bulk.

5. Significance and Applications

Theoretical Contribution:

Bridging the Gap: This work unifies the theory of fixed-rank perturbations and growing-rank perturbations. It resolves the mathematical inconsistency between theoretical models (which assumed fixed spikes) and empirical observations (which show a "bulk decay" with many spikes).
Generalization: It removes the restrictive assumption that the background signal matrix must be a delta function or have a fixed rank, providing a more robust framework for analyzing complex spectral structures.

Practical Application (Deep Learning):

DNN Pruning: The results provide a rigorous mathematical foundation for Marchenko-Pastur (MP) pruning techniques. In DNNs, pruning involves removing weights below a certain threshold (related to the edge of the MP distribution).
Signal vs. Noise: The paper clarifies how the "signal" (learned weights) with a growing number of significant eigenvalues interacts with the "noise" (random initialization). It validates that even with a large number of spikes, the spectral behavior is predictable and follows the $\Phi$ mapping.
Model Efficiency: By understanding the asymptotic behavior of these large-rank perturbations, researchers can better design pruning algorithms that preserve network accuracy while reducing model size, crucial for deploying DNNs on resource-constrained devices.

Numerical Validation:
The authors include numerical simulations on DNNs trained on the Fashion MNIST dataset. These simulations confirm that the number of outlier eigenvalues in the "signal" matrix scales with the network size, and their distribution aligns with the theoretical predictions derived in the paper, validating the "bulk decay" hypothesis over the traditional "spike" hypothesis.

Conclusion

This paper significantly advances Random Matrix Theory by characterizing the spectral properties of deformed Wigner matrices under large rank perturbations with general bulk distributions. The derived asymptotic formulas for outlier eigenvalues provide a critical theoretical tool for analyzing and optimizing Deep Neural Networks, specifically in the context of model pruning and understanding the interplay between learned signals and random noise in high-dimensional spaces.

Asymptotic behavior of eigenvalues of large rank perturbations of large random matrices