Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness

The Big Picture: The "Puzzle" Problem

Imagine you have a giant jigsaw puzzle, but someone has hidden most of the pieces. You only see a few scattered pieces (the observed data). Your job is to guess what the whole picture looks like (the matrix completion).

In the world of AI, we use "neural networks" to solve this. These networks are like teams of workers trying to figure out the missing picture. The paper asks a very specific question: Does having a "deeper" team (more layers of workers) help them find a simpler, cleaner solution?

The answer is a resounding yes. The paper proves that deeper networks naturally prefer simple, low-rank solutions (like a picture with just a few basic shapes) over complex, messy ones, even if the data they see is sparse.

1. The "Shallow" vs. "Deep" Team

To understand the discovery, let's look at two types of teams:

The Shallow Team (Depth 2): Imagine a team with just two layers of workers. They pass a message from the boss to the worker.
- The Problem: If the puzzle pieces they see are scattered in a way that doesn't connect (like seeing only the top-left and bottom-right corners), the two layers act like two separate, isolated islands. They don't talk to each other about the missing middle.
- The Result: They often guess a messy, complex picture because they can't coordinate to find the simple pattern.
The Deep Team (Depth 3+): Imagine a team with three or more layers. The message has to pass through a middle layer of workers.
- The Magic: Even if the puzzle pieces are scattered and disconnected, the middle layer acts as a giant hub. Every worker in the middle layer is involved in calculating every part of the final picture.
- The Result: Because everyone in the middle is connected to everything else, the whole team is forced to "couple" their efforts. They naturally align to find the simplest possible solution that fits the data.

The Analogy:
Think of the Shallow Team as two people trying to build a house by only looking at the front door and the back door. They might build a weird, disjointed structure because they aren't talking about the walls in between.
Think of the Deep Team as a construction crew where every bricklayer is connected to a central scaffolding system. Even if they only see a few bricks, the scaffolding forces them to build a coherent, simple wall because they are all working on the same central structure.

2. The "Coupled" Dance

The paper introduces a concept called "Coupled Dynamics."

Decoupled (Shallow): The workers move independently. One worker fixes the left side, another fixes the right side, and they never influence each other. This leads to a messy, high-rank solution (a complex, cluttered picture).
Coupled (Deep): The workers are holding hands. If one moves, they all move. This "dance" forces them to synchronize. The paper proves that in deep networks, this coupling happens naturally, regardless of how the data is scattered. This synchronization is what pushes the network to find the Low-Rank solution (the simplest, most elegant picture).

3. The "Loss of Plasticity" (The Frozen Brain)

The second part of the paper tackles a phenomenon called "Loss of Plasticity." This is a fancy way of saying: "Once a neural network learns something, it gets stuck and can't learn new things well."

The Scenario:

Phase 1 (Pre-training): You train a network on a tiny, sparse dataset (like only seeing the corners of the puzzle).
- Shallow Network: Because it's shallow, it gets stuck in a "messy" state. It memorizes the corners in a complicated way.
- Deep Network: Because it's deep, it naturally finds a simple, low-rank solution even with the tiny data.
Phase 2 (Warm-start): Now, you give the network more data (the rest of the puzzle) and ask it to keep learning from where it left off.
- Shallow Network: It fails. It's like a student who memorized the corners of a map in a weird way. When you give them the rest of the map, they can't adjust their brain. They stay stuck in the messy, high-rank solution. They have lost their plasticity (flexibility).
- Deep Network: It succeeds. Because it started with a simple, low-rank solution, it has room to grow. When new data arrives, it can easily adjust its simple structure to fit the new pieces. It stays flexible.

The Analogy:
Imagine trying to learn a new dance.

The Shallow Network learns the first few moves by flailing its arms wildly (high rank). When you teach it the rest of the dance, it can't stop flailing; it's stuck in that chaotic pattern.
The Deep Network learns the first few moves by finding the core rhythm (low rank). When you teach it the rest, it easily adds new steps to that rhythm. It stays flexible.

4. Why Does This Matter?

This paper solves a mystery that has confused researchers for years: Why do deep neural networks generalize so well?

It turns out that depth isn't just about having more "brain power." It's about structure. Depth forces the network's internal parts to talk to each other (coupling). This internal conversation acts as a built-in "simplicity filter," pushing the network to ignore noise and find the simplest truth.

Furthermore, it explains why re-training (warm-starting) often fails for shallow models but works for deep ones. If you start with a messy, high-rank solution, you can't easily clean it up later. But if you start with a clean, low-rank solution, you can build upon it.

Summary in One Sentence

Deep neural networks are like a tightly knit team that naturally collaborates to find the simplest answer, whereas shallow networks are like isolated individuals who get stuck in messy habits and can't adapt when new information arrives.

1. Problem Statement

The paper investigates the implicit bias of deep linear neural networks (deep matrix factorization) in the context of matrix completion. Specifically, it addresses two core questions:

Depth-Induced Low-Rank Bias: Why do deeper networks ( $L \ge 3$ ) exhibit a stronger tendency to converge to low-rank solutions compared to shallow networks ( $L=2$ ), even when the observed data forms a disconnected graph (where shallow networks typically fail to find low-rank solutions)?
Loss of Plasticity: Why do models pre-trained on sparse data often fail to adapt to new, augmented data (warm-starting) to find low-rank solutions, a phenomenon known as "loss of plasticity"? The paper specifically explores how network depth influences this phenomenon.

The authors use gradient flow (continuous-time limit of gradient descent) on overparameterized linear networks to analyze these dynamics theoretically.

2. Methodology

The authors employ a rigorous theoretical framework combining dynamical systems analysis and algebraic properties of matrix factorization:

Coupled vs. Decoupled Dynamics: The central concept introduced is the distinction between coupled and decoupled training dynamics.
- Decoupled: The gradients of different observed entries are orthogonal, allowing independent optimization of sub-problems (common in $L=2$ with disconnected observations).
- Coupled: Gradients of different entries interact through shared intermediate parameters, forcing the network to optimize the factors jointly.
Block-Diagonal Observation Setting: To isolate the effect of depth, the authors analyze a specific setting where the ground truth matrix has a block-diagonal structure, and observations are confined to these blocks. This generalizes the diagonal observation case ( $s=1$ ) and allows for tractable analysis of singular value evolution.
Deterministic Initialization: The authors utilize a specific family of deterministic initializations (parameterized by $\alpha$ and $m$ ) that interpolates between scaled identity ( $\alpha I$ ) and scaled all-ones ( $\alpha J$ ) matrices. This allows them to control the initial rank and derive closed-form or implicit equations for the converged singular values.
Warm-Start Analysis: For the plasticity loss, they analyze a two-phase training process: pre-training on a sparse (disconnected) set followed by post-training on an augmented (connected) set, starting from the converged pre-trained weights.

3. Key Contributions & Theoretical Results

A. Depth Promotes Low-Rankness via Coupled Dynamics

Mechanism Identification: The paper proves that for $L \ge 3$ , the training dynamics are coupled with probability one under generic initialization, regardless of whether the observation graph is connected or disconnected. In contrast, $L=2$ models exhibit decoupled dynamics for disconnected observations.
Theorem 3.3 (Converged Singular Values): For block-diagonal observations, the authors derive implicit equations governing the singular values ( $\sigma$ $σ$ ) of the converged matrix $W_{L:1}(\infty)$ $W_{L : 1} (\infty)$ :
- Case $L=2$ (Decoupled): The singular values converge to a rank- $n$ solution (where $n$ is the number of blocks), independent of the initialization scale $\alpha$ .
- Case $L \ge 3$ (Coupled): The singular values satisfy implicit equations dependent on $\alpha$ . Crucially, as the initialization scale $\alpha \to 0$ , the stable rank converges to 1 (Corollary 3.4).
Resolution of Open Problem: This theoretically resolves an open question raised by Menon (2024) regarding the convergence of deep factorization to low-rank solutions for specific initializations, demonstrating that depth inherently strengthens the low-rank bias even without data connectivity.

B. Mechanism of Loss of Plasticity in Depth-2 Networks

The Phenomenon: The paper explains why pre-training on disconnected data (e.g., diagonal entries) leads to a high-rank solution that cannot be "fixed" by adding more data later.
Theorem 4.2 (2x2 Case): For a depth-2 network pre-trained on diagonal entries with $\alpha I$ $α I$ initialization, the model converges to a high-rank state. When an off-diagonal entry is added (making the graph connected), the model enters a "lazy training" regime.
- The loss decreases exponentially fast, but the model gets trapped in a local minimum near the initial high-rank state.
- The unobserved entry $w_{21}$ converges to a negative value (contradicting the true positive rank-1 ground truth), and the stable rank remains bounded away from 1.
Theorem 4.3 (General $d \times d$ Case): This result is generalized to arbitrary dimensions. If the pre-training loss is sufficiently small (perfect fit), the Jacobian of the model has small singular values, leading to lazy training. The parameters do not move far from initialization, preventing the network from re-organizing into a low-rank structure even when new connectivity is introduced.

C. Deep Networks Avoid Plasticity Loss

Empirical and theoretical evidence suggests that deep networks ( $L \ge 3$ ) avoid this plasticity loss. Because their dynamics are inherently coupled, they maintain a low-rank bias even during the pre-training phase on sparse data. Consequently, when augmented with more data, they can continue to refine the low-rank solution rather than getting stuck in a high-rank local minimum.

4. Experimental Results

Synthetic Matrix Completion: Experiments on $2 \times 2$ and larger matrices confirm that $L \ge 3$ models converge to rank-1 solutions under diagonal observations (disconnected), while $L=2$ models converge to full rank.
Initialization Scale: The experiments validate that smaller initialization scales ( $\alpha$ ) in deep networks lead to stronger low-rank bias, consistent with the theoretical limit $\alpha \to 0$ .
Plasticity Experiments: Simulations of the warm-start scenario show that $L=2$ models pre-trained on sparse data fail to recover the low-rank ground truth when new data is added, whereas $L \ge 3$ models successfully adapt.
Practical Neural Networks: Experiments on ResNet and VGG architectures trained on CIFAR-10/100 show that deeper networks exhibit lower average effective rank in their weight matrices, supporting the theoretical findings in practical non-linear settings.

5. Significance and Impact

Theoretical Foundation: The paper provides the first rigorous theoretical explanation for why depth induces a stronger low-rank bias in matrix completion, moving beyond the "data connectivity" framework which was limited to shallow networks.
Unifying Concept: It introduces coupled dynamics as the fundamental mechanism linking network depth, data connectivity, and implicit regularization.
Plasticity Insight: It offers a mechanistic explanation for the "loss of plasticity" phenomenon, attributing it to the "lazy training" regime induced by high-rank pre-trained states in shallow networks. This suggests that depth is not just a capacity booster but a structural property that preserves adaptability.
Practical Implications: The findings suggest that for tasks requiring adaptability to new data or where data is initially sparse, deeper architectures may be theoretically superior due to their inherent ability to maintain low-rank biases and avoid getting trapped in high-rank local minima.

In summary, the paper demonstrates that depth promotes low-rankness by enforcing coupled training dynamics, which allows deep networks to overcome the limitations of disconnected data and avoid the loss of plasticity that plagues shallow networks in incremental learning scenarios.