Why Do Neural Networks Forget: A Study of Collapse in Continual Learning

The Big Problem: The "Goldfish" Brain

Imagine you are teaching a student (a neural network) to play different sports. First, you teach them soccer. They get really good. Then, you teach them basketball. Suddenly, they forget how to kick a ball and can't dribble either. This is called Catastrophic Forgetting.

In the world of AI, when a computer learns a new task, it often overwrites the old memories to make room for the new ones. It's like trying to write a new story on a piece of paper that already has a story written on it; if you don't have a special eraser or a new page, you just scribble over the old words, destroying them.

The Hidden Culprit: "The Crumpled Map"

For a long time, scientists thought forgetting happened because the computer's "brain" got confused by conflicting instructions (like trying to walk forward and backward at the same time).

But this paper argues that the real problem is structural collapse.

The Analogy:
Imagine your brain's ability to learn is like a map of a city.

Healthy Learning: The map is huge, with thousands of streets, alleys, and highways. You can easily find a new route to a new destination without getting lost.
Catastrophic Forgetting: As you learn more tasks, the map starts to crumple. All the streets get squished together into a tiny, flat ball. Now, there are no new roads to build. Every time you try to learn something new, you have to crush the old roads to make space. The map has "collapsed."

The paper introduces a tool called eRank (Effective Rank) to measure how "crumpled" the map is.

High eRank: The map is wide open, full of diverse directions. The brain is flexible and plastic.
Low eRank: The map is a tight ball. The brain is rigid and has lost its ability to adapt.

The Experiment: Testing Different "Schools"

The researchers tested four different types of AI "students" (architectures) to see how they handle this crumpling:

MLP (The Basic Student): A simple, straight-line thinker. It crumples its map very quickly.
ResNet-18 (The Veteran): A more complex student with "skip connections" (like secret shortcuts). It holds the map open a bit longer but eventually crumples it anyway.
ConvGRU (The Time Traveler): A student that remembers the past using a "gated" memory. It keeps the map from crumpling too fast, but it starts with a smaller map to begin with.
Bi-ConvGRU (The Double Time Traveler): Like the Time Traveler, but looking at the past and the future. It's better at holding the map open, but still struggles with complex tasks.

They taught these students using three different study methods:

SGD (The "Just Do It" Method): Just learn the new thing. Result: The map crumples instantly. Total forgetting.
LwF - Learning Without Forgetting (The "Don't Change Your Personality" Method): The student is told, "Make sure you still sound like you did before." Result: The student sounds the same (good accuracy), but the internal map is still crumpling. They are faking it until they break.
ER - Experience Replay (The "Flashcard" Method): The student keeps a small box of flashcards from old lessons and reviews them while learning new ones. Result: The map stays wide open! The student remembers everything.

The Big Discovery

The paper found a direct link: When the map crumples (low eRank), the student forgets.

The "Just Do It" (SGD) method causes the map to crumple immediately.
The "Flashcard" (ER) method is the only one that keeps the map wide and diverse. By constantly reviewing old flashcards, the student is forced to keep many different "roads" open, preventing the map from collapsing.
The "Don't Change" (LwF) method is a bit of a trick. It keeps the output (the answers) correct, but the internal structure (the map) still collapses. This means the student might pass a test today but will fail a harder test tomorrow because they've lost the flexibility to learn new things.

The Takeaway

To stop AI from forgetting, we can't just tell it to "remember the answers." We have to protect the structure of its brain.

Think of it like a gym. If you only lift weights for one specific muscle, that muscle gets huge, but the rest of your body atrophies. To stay fit (plastic), you need to work out the whole body.

Experience Replay is like doing a full-body workout; it keeps the whole system flexible.
Forgetting happens when the AI stops working out its "muscles" (feature directions) and just focuses on the immediate task, causing its internal world to shrink and collapse.

In short: Neural networks forget because their internal "maps" get too small to hold new information. The best way to fix this is to keep reviewing old lessons (Experience Replay) to keep the map big and diverse.

1. Problem Statement

The paper addresses Catastrophic Forgetting (CF) in Continual Learning (CL), where neural networks lose performance on previously learned tasks when trained on new sequential tasks.

The Gap: Most existing CL research evaluates success solely based on task accuracy, ignoring the internal structural changes within the model.
The Hypothesis: The authors propose that forgetting is not merely a result of conflicting gradients or classifier drift, but a geometric failure caused by Representational Collapse. This occurs when the model's internal feature space shrinks into a low-dimensional subspace, causing a loss of plasticity (the inability to learn new directions without overwriting old ones).
The Metric: To quantify this, the paper utilizes Effective Rank (eRank), a metric that measures the dimensionality and diversity of a matrix (weights or activations). Low eRank indicates compression and redundancy; high eRank indicates rich, diverse representation.

2. Methodology

A. Experimental Setup

The study evaluates the correlation between forgetting and collapse across different architectures and learning strategies using two benchmarks:

Split MNIST (Task-Incremental Learning): 5 binary classification tasks (digits 0-1, 2-3, etc.).
- Architectures: Multilayer Perceptron (MLP) and Convolutional Gated Recurrent Unit (ConvGRU).
Split CIFAR-100 (Class-Incremental Learning): 20 tasks with 5 classes each, using a single shared output head.
- Architectures: ResNet-18 and Bidirectional ConvGRU (Bi-ConvGRU).

B. Continual Learning Strategies

Three strategies are compared:

Vanilla SGD: Baseline with no forgetting mitigation.
Experience Replay (ER): Maintains a replay buffer of past data samples to retrain on old tasks alongside new ones.
Learning without Forgetting (LwF): Uses knowledge distillation to force the new model to match the output of a "teacher" model (frozen copy of the previous state), preserving functional behavior without storing data.

C. Measurement Metrics

The authors track four key metrics:

Average Accuracy & Forgetting: Standard performance metrics.
Activation eRank: Measured on the hidden layer representations (penultimate layer) to detect representational collapse (loss of feature diversity).
Weight eRank: Measured on weight matrices across different layers (Early, Middle, Late, Classification) to detect structural collapse (loss of transformation capacity).
Peak-Normalized eRank ( $eRank_{pct}$ ): Used to compare architectures with different initial dimensionalities by normalizing against the historical peak eRank.

3. Key Contributions

Linking Forgetting to Geometric Collapse: The paper provides empirical evidence that catastrophic forgetting is directly correlated with the collapse of the effective rank in both weights and activations. As eRank drops, the model loses the degrees of freedom necessary to learn new tasks, forcing it to overwrite existing representations.
Dual Perspective Analysis: Unlike previous works focusing only on activations, this study analyzes both weight matrices and activation representations, offering a complete view of how structural and representational collapse drive forgetting.
Architectural Comparison: The study reveals how different architectural biases (feedforward vs. recurrent, skip connections vs. gating) influence the trajectory of collapse.
Strategy Efficacy Evaluation: It demonstrates that while LwF stabilizes outputs, it fails to prevent internal structural collapse, whereas Experience Replay is uniquely effective at preserving both structural and representational capacity.

4. Key Results

A. Correlation Between eRank and Forgetting

SGD Baseline: Models trained with vanilla SGD show a sharp decline in both activation and weight eRank as tasks accumulate. This drop correlates perfectly with the rise in forgetting and the drop in accuracy.
Collapse Mechanism: When eRank collapses, the feature space becomes compressed into a few dominant directions. The network can no longer separate new classes from old ones without distorting the decision boundaries.

B. Effectiveness of Strategies

Experience Replay (ER):
- Performance: Achieves the highest accuracy and lowest forgetting across all architectures.
- Mechanism: ER is the only strategy that successfully maintains or increases eRank over time. By replaying old data, it forces the network to maintain diverse feature directions, preventing the subspace from shrinking.
Learning without Forgetting (LwF):
- Performance: Improves accuracy over SGD but is significantly outperformed by ER.
- Mechanism: LwF stabilizes activation eRank (functional behavior) but fails to prevent weight eRank collapse. The internal weights still lose structural complexity, leading to a "hollow" model that performs well initially but lacks the plasticity for long-term learning.

C. Architectural Insights

MLP (Feedforward): Highly susceptible to rapid collapse. Without structural protections, gradients overwrite weights immediately, causing eRank to plummet.
ResNet-18: Skip connections delay the initial collapse by stabilizing gradient flow, but they do not prevent permanent collapse as tasks accumulate. The benefit is temporary.
ConvGRU / Bi-ConvGRU (Recurrent):
- Recurrent gating mechanisms compress representations early to manage temporal dependencies.
- While this reduces immediate gradient interference (delaying catastrophic failure), it results in aggressive early compression.
- Consequently, recurrent models often start with lower eRank and have limited capacity to expand, trading representational richness for stability.

5. Significance and Conclusion

Redefining Forgetting: The paper shifts the paradigm of understanding catastrophic forgetting from a purely optimization problem (conflicting gradients) to a structural capacity problem. Forgetting is identified as the loss of the model's ability to expand its feature space.
Implications for CL Design:
- Simply constraining outputs (LwF) is insufficient; the internal structural capacity (weights) must also be preserved.
- Experience Replay is superior because it actively reconstructs the feature subspace, maintaining high eRank.
- Future CL methods should prioritize metrics like eRank to monitor "plasticity loss" rather than just accuracy.
Limitations: The study focuses on supervised learning with simple datasets (MNIST/CIFAR). The authors note that findings may vary in self-supervised settings or with attention-based architectures (Transformers), suggesting directions for future work.

In summary, the paper concludes that catastrophic forgetting is a geometric failure driven by representational and structural collapse. Effective continual learning strategies must not only stabilize outputs but also actively preserve the high-dimensional diversity of the model's internal representations.