Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

Imagine you have a brilliant, well-read librarian (the AI model) who has read millions of books. Now, you want to teach this librarian a few new, specific skills without making them forget everything they already know.

This is the challenge of Continual Learning. If you teach them too much too fast, they might suffer "Catastrophic Forgetting"—suddenly forgetting how to write a poem because they are now obsessed with coding.

To solve this, researchers use a technique called LoRA (Low-Rank Adaptation). Think of LoRA as giving the librarian a small, specialized notepad to write new notes on, rather than rewriting their entire library. The size of this notepad is called the Rank.

This paper asks a simple question: Does the size of the notepad matter for how much the librarian forgets?

The Big Discovery: It's About the "Angle," Not the Size

The authors found that the size of the notepad (the Rank) actually matters very little in most cases. Instead, what really determines forgetting is the geometric relationship between the new task and the old tasks.

Here is the core concept using a simple analogy:

The "Dance Floor" Analogy

Imagine the librarian's knowledge is a giant dance floor.

Task 1 (Old Knowledge): The librarian is dancing a Waltz.
Task 2 (New Knowledge): You want them to learn a new dance.

There are two scenarios:

The Similar Dance (Low Angle): If the new dance is a slightly different Waltz, the steps overlap heavily. The librarian has to overwrite their old muscle memory to learn the new steps. If you give them a small notepad, they might get confused and forget the old Waltz. If you give them a big notepad, they might get too confident and overwrite the old Waltz even faster. In this case, the size of the notepad matters a lot.
The Totally Different Dance (High Angle): If the new dance is Breakdancing, it has almost nothing in common with the Waltz. The "steps" (gradients) are at a 90-degree angle to each other. Because they are so different, learning to Breakdance doesn't mess up the Waltz at all.
- The Surprise: In this scenario, it doesn't matter if you give the librarian a tiny notepad or a giant one. They will remember the Waltz perfectly either way. The "forgetting" is near zero regardless of the size.

The "Magic Formula"

The authors discovered a mathematical law that predicts forgetting based on how "different" the tasks are. They call it the Geometric Forgetting Law:

Forgetting = Constant × (How different the dances are) + Background Noise

How different the dances are: This is measured by the "Principal Angle." If the angle is wide (dances are very different), forgetting is low. If the angle is narrow (dances are similar), forgetting is high.
The Size of the Notepad (Rank): The paper shows that once the dances are different enough (high angle), changing the size of the notepad has almost zero effect on forgetting.

Why This Matters in Real Life

The paper tested this on real AI models (like those that read text or look at images) and found:

You Don't Need Big Notepads for Diverse Tasks: If you are teaching an AI very different things (e.g., first teaching it to write code, then teaching it to diagnose medical images), you don't need a massive "adapter" to prevent forgetting. A small, efficient one works just as well. This saves money and computing power.
The "Orthogonal" Trick is Overkill: Some researchers try to force the AI to keep tasks separate by using special math tricks (like O-LoRA) to make the tasks "orthogonal" (at 90 degrees). The paper shows that if the tasks are already naturally different (like code vs. medicine), these fancy tricks don't help at all. You only need them if the tasks are very similar.
When Size Does Matter: If you are teaching the AI two very similar things (like two different dialects of the same language), then the size of the notepad does matter. You need to be careful with how you update the model.

The Bottom Line

The paper solves a mystery in AI research: Why do some studies say "bigger adapters are worse" while others say "size doesn't matter"?

The answer is: It depends on the angle.

Similar tasks? Size matters.
Different tasks? Size doesn't matter.

This gives engineers a clear rule of thumb: Don't waste resources making huge adapters for diverse tasks. Just check how "different" your new task is from the old ones, and you'll know exactly how much forgetting to expect.

1. Problem Statement

The paper addresses the challenge of catastrophic forgetting in Continual Learning (CL) when using Low-Rank Adaptation (LoRA) for Large Language Models (LLMs) and vision models.

Context: LoRA is a parameter-efficient fine-tuning (PEFT) method that updates pre-trained weights via low-rank matrices ( $\Delta W = BA$ ). While effective, the theoretical understanding of why and how LoRA forgets previous tasks during sequential training is incomplete.
Gap: Previous empirical studies (e.g., Biderman et al., 2024) suggested that higher LoRA ranks lead to increased forgetting. However, the underlying geometric mechanisms driving this behavior, and the relationship between adapter rank and task similarity, remained unclear.

2. Methodology & Theoretical Framework

The authors propose a geometric theory characterizing forgetting through the lens of gradient subspace interactions.

Core Concepts

Gradient Subspaces: For each task $t$ , the gradient subspace $G_t$ is defined as the span of gradients $\nabla L_t(\theta)$ .
Principal Angles: The relationship between two task subspaces is quantified by the minimum principal angle ( $\theta_{min}$ $θ_{min}$ ).
- $\theta_{min} \approx 0$ : Tasks are similar (subspaces are aligned).
- $\theta_{min} \approx \pi/2$ : Tasks are diverse/orthogonal.

The Geometric Forgetting Law

The central theoretical contribution is an empirically parameterized bound on forgetting ( $F$ ):
$F = \alpha(1 - \cos^2 \theta_{min}) + \beta$
Where:

$1 - \cos^2 \theta_{min} = \sin^2 \theta_{min}$ represents the separation term (interference structure).
$\alpha$ is a scaling factor dependent on learning rate, loss smoothness, and update norms.
$\beta$ is a baseline forgetting term.
Key Insight: Forgetting is driven by the geometric separation of task subspaces, not merely the capacity of the adapter.

Rank-Angle Interaction Theory

The paper introduces a unified theory to reconcile conflicting literature:

Low Angle Regime (Similar Tasks): When $\theta_{min} \approx 0$ , the effective rank of updates scales with the nominal rank ( $r$ ). Here, higher ranks lead to more forgetting (consistent with prior work).
High Angle Regime (Diverse Tasks): When $\theta_{min}$ is large (tasks are orthogonal), the effective rank of the gradient updates saturates to a constant (empirically $\approx 1$ ), regardless of the nominal rank $r$ .
Result: In the high-angle regime, forgetting becomes approximately rank-invariant.

3. Key Contributions

Geometric Forgetting Law: Proposed and validated the functional form $F = \alpha(1 - \cos^2 \theta_{min}) + \beta$ , enabling quantitative prediction of forgetting based on subspace geometry.
Approximate Rank-Invariance: Demonstrated that for diverse tasks (high subspace angles), forgetting is largely independent of the LoRA adapter rank.
- Synthetic data: Coefficient of Variation (CV) $\approx 0.8\%$ .
- Real benchmarks: CV $\approx 10–19\%$ .
Unified Regime Characterization: Resolved contradictions in existing literature by showing that rank effects are regime-dependent: rank matters for similar tasks (low angles) but becomes negligible for diverse tasks (high angles).
Analysis of Orthogonal Methods: Showed that explicit orthogonalization methods (like O-LoRA) offer minimal benefit when natural task orthogonality is already high, as vanilla LoRA already achieves sufficient separation in these regimes.

4. Experimental Results

The theory was validated across three domains:

Synthetic Tasks:
- Generated tasks with controlled principal angles.
- Result: The interference term $(1 - \cos^2 \theta_{min})$ correlated with forgetting with $r = 0.994$ .
- Rank Invariance: Varying rank from 1 to 32 resulted in a CV of 0.84%, confirming the theory.
Split-CIFAR100 (Vision):
- Used ViT-Base with LoRA (ranks 4, 8, 16).
- Result: CV of forgetting across ranks was 18.5%.
- Layer-wise Analysis: Resolved a negative aggregate correlation by showing that 6 out of 7 layers exhibited positive interference-forgetting correlation ( $r=0.525$ ). The aggregate negative correlation was due to confounding factors (tasks with similar representations were also easier to transfer).
Sequential GLUE (NLP):
- Used RoBERTa-base with LoRA on 5 sequential NLP tasks.
- Result: CV of forgetting was 9.9%, supporting approximate rank-invariance.
- Orthogonal Methods: O-LoRA showed no statistically significant improvement over vanilla LoRA ( $p=0.73$ ), confirming that natural orthogonality was already high.

5. Significance and Practical Implications

Reconciling Literature: The paper explains why some studies find rank increases forgetting while others do not: it depends entirely on the diversity (angle) of the task sequence.
Adapter Sizing: Practitioners do not need to reduce LoRA rank to prevent forgetting in diverse task sequences; sufficient rank should be used for task performance without fear of increased forgetting.
Diagnostic Tool: Computing principal angles between accumulated gradient matrices can serve as a diagnostic to predict forgetting and guide intervention.
Method Selection: Explicit orthogonalization methods (O-LoRA) are only necessary when tasks are highly similar (low angles). For diverse sequences, they add computational overhead with diminishing returns.
Future Directions: The work suggests a path toward principled continual learning that relies on geometric properties of gradients rather than heuristic regularization.

Conclusion

The paper establishes that subspace geometry, specifically the minimum principal angle between task gradients, is the primary governing factor of catastrophic forgetting in LoRA. This leads to a surprising rank-invariance property in diverse learning scenarios, providing a theoretical foundation for optimizing parameter-efficient continual learning strategies.