Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

The Big Problem: The "Overloaded Backpack"

Imagine you are training a giant robot (a Large Language Model like Llama) to learn how to speak, write code, or solve math problems. To teach this robot, you use a "teacher" called an Optimizer (like Adam or Muon).

The teacher doesn't just look at the robot's current mistake; it remembers past mistakes to learn from them.

First-order memory: "I made this specific mistake last time, so I'll adjust slightly."
Second-order memory: "I made this mistake really fast last time, so I need to be extra careful."

In technical terms, these memories are called momentum states. The problem is that for a giant robot with billions of parameters, these memories are huge. They take up so much computer memory (RAM) that you can't train the biggest models without buying a supercomputer. It's like trying to hike a mountain with a backpack so heavy it crushes you.

The Old Solution: "The Projector"

Previously, researchers tried to fix this by using Low-Rank methods (like GaLore).

The Analogy: Imagine you have a high-definition movie (the full memory). To save space, you project it onto a small, low-resolution screen. You do this by calculating a "best fit" projection.
The Flaw: This projection isn't perfect. Every time you project the movie, you lose a little bit of detail. To fix the blurry picture, you have to stop, re-calculate the projection, and update the screen every few minutes. This "stop-and-go" process is slow and introduces errors, making the robot learn less efficiently.

The New Idea: LoRA-Pre (The "Smart Note-Taker")

The authors of this paper, LoRA-Pre, had a brilliant realization. They looked at the math behind how the teacher remembers things (Exponential Moving Average, or EMA) and realized something surprising:

The teacher's memory is actually just a simple math class in progress.

The Core Insight: The "Secret Linear Regressor"

The paper proves that when the optimizer updates its memory, it is mathematically identical to a student trying to predict the next test question based on past questions.

Old View: "I am updating a giant, complex memory matrix."
New View: "I am training a tiny, simple linear model to guess what the next gradient (mistake) will look like."

Because it's just a prediction model, we don't need to store the entire history. We just need to store the rules the student is using to make predictions.

The Solution: The "Sketchbook" vs. The "Photo Album"

Instead of keeping a massive photo album of every single mistake (which takes up gigabytes), LoRA-Pre keeps a tiny sketchbook.

Decomposition: It breaks the giant memory down into two small, thin matrices (think of them as two thin sheets of paper).
Continuous Learning: Instead of stopping to re-calculate the projection every few minutes (like the old method), LoRA-Pre updates these two sheets every single step of the training.
The Result: The sketchbook is so small it fits in your pocket, but because it updates instantly, it captures the "essence" of the giant photo album perfectly.

Why is this better? (The "Rank Efficiency" Metaphor)

The paper introduces a concept called Rank. Think of "Rank" as the resolution of your sketchbook.

Old methods (GaLore): To get a clear picture, you need a high-resolution sketchbook (Rank 256). If you try to use a low-resolution one (Rank 16), the picture is blurry, and the robot learns poorly.
LoRA-Pre: Because it updates its sketchbook constantly rather than periodically, it can use a tiny, low-resolution sketchbook (Rank 16 or 32) and still see the picture clearly.

The Magic Stat: LoRA-Pre achieves the same (or better) results using 1/8th of the memory required by previous methods. It's like getting a 4K TV experience from a tiny, pocket-sized projector.

Real-World Results

The researchers tested this on the Llama family of models (from tiny 60M parameter models up to 1B).

Pre-training: When training from scratch, LoRA-Pre beat all other memory-saving methods. It learned faster and made fewer mistakes.
Fine-tuning: When adapting a pre-trained model to a new task (like math), it crushed the competition, beating standard LoRA by a huge margin (e.g., +6 points on math tests).

Summary in One Sentence

LoRA-Pre realizes that an AI's memory is just a simple prediction game, allowing us to shrink the massive "backpack" of memory down to a tiny "notebook" that updates instantly, saving huge amounts of computer power without sacrificing intelligence.

Who is this for?

Researchers: It's a new, faster way to train giant AI models on cheaper hardware.
Developers: It means you can run better AI models on your own servers without needing a million-dollar data center.
Everyone: It's a step toward making powerful AI more accessible and less energy-hungry.

1. Problem Statement

Large Language Models (LLMs) require massive computational resources for pre-training and fine-tuning, largely due to the memory overhead of optimizer states. Standard optimizers like Adam and Muon maintain first-order ( $m$ ) and second-order ( $v$ ) momentum estimates for every parameter.

The Bottleneck: Storing these full-rank momentum matrices triples the memory usage compared to storing model weights alone, creating a scalability bottleneck.
Limitations of Existing Solutions: Current low-rank optimization methods (e.g., GaLore, Fira) typically rely on projected gradient descent. They project gradients into a low-rank subspace using SVD or random projections, compute updates, and project back. However, these methods often require periodic subspace updates. The lag between updates causes error accumulation and optimization discontinuities, leading to suboptimal performance, especially in pre-training from scratch where full-rank updates are theoretically required.

2. Core Methodology: LoRA-Pre

The authors propose LoRA-Pre, a novel low-rank optimizer that fundamentally rethinks how momentum is maintained.

A. Theoretical Insight: Momentum as Online Linear Regression

The paper establishes a crucial mathematical equivalence:

Standard EMA Update: $m_{t+1} = \beta \cdot m_t + (1-\beta) \cdot g_t$
Reformulation: This update is mathematically equivalent to performing online gradient descent on a linear regressor where the parameter is the momentum $m$ , the input is the gradient $g$ , and the loss function is the squared error:
$\min_m L(m; g) = \frac{1}{2} \|m - g\|_F^2$
Implication: Instead of viewing momentum as a simple accumulator, it can be viewed as a linear model trained to approximate the history of gradients. This allows the application of model compression techniques (like low-rank factorization) directly to the optimizer states.

B. Low-Rank Factorization

Instead of storing the full momentum matrix $m \in \mathbb{R}^{p \times q}$ , LoRA-Pre decomposes it into two low-rank matrices:
$m \approx m_B \cdot m_A$
where $m_B \in \mathbb{R}^{p \times r}$ and $m_A \in \mathbb{R}^{r \times q}$ , with $r \ll \min(p, q)$ .

C. Update Rules via Newton's Method

To update these factors efficiently without backpropagation through the factorization, the authors derive closed-form update rules using Newton's method.

First-Order Momentum ( $m$ ): The objective is $\min_{m_B, m_A} \frac{1}{2} \|m_B m_A - g\|_F^2$ . The update rules are:
$m_B \leftarrow (1-\gamma_1)m_B + \gamma_1 \cdot g m_A^T (m_A m_A^T)^{-1}$
$m_A \leftarrow (1-\gamma_1)m_A + \gamma_1 \cdot (m_B^T m_B)^{-1} m_B^T g$
Here, $\gamma_1$ is coupled with the standard momentum decay $\beta_1$ such that $(1-\gamma_1)^2 = \beta_1$ , ensuring the decay dynamics match standard Adam.
Second-Order Momentum ( $v$ ): Since $v$ must be element-wise positive (for the square root in Adam), the authors re-parameterize $v = (v_B v_A) \circ 2$ (Hadamard product of squares). They optimize the magnitude $|g|$ using a similar low-rank regression approach to ensure positivity is maintained.

D. Algorithm Variants

The method is general and implemented for two popular optimizers:

LoRA-Pre Adam: Compresses both first and second moments.
LoRA-Pre Muon: Adapts the orthogonalization logic of the Muon optimizer to the low-rank subspace, deriving specific update rules for the Muon momentum term.

3. Key Contributions

Theoretical Equivalence: Proved that EMA momentum updates are equivalent to training an online linear regressor via gradient flow, providing a new theoretical lens for optimizer design.
Novel Optimizer (LoRA-Pre): Introduced a memory-efficient optimizer that compresses states via low-rank factorization with continuous subspace adaptation (updating factors at every step) rather than periodic updates.
Closed-Form Updates: Derived explicit, backpropagation-free update rules for the low-rank factors using Newton's method, ensuring computational efficiency.
Versatility: Successfully adapted the method to both Adam and Muon optimizers.

4. Experimental Results

The authors evaluated LoRA-Pre on pre-training and fine-tuning tasks using Llama architectures (60M to 1B parameters for pre-training; 7B/8B for fine-tuning).

A. Pre-Training Performance

Benchmark: Compared against Adam, Muon, GaLore, Fira, LoRA, and other low-rank baselines on the C4 dataset.
Results: LoRA-Pre achieved the highest or second-highest performance across all model sizes (60M, 130M, 350M, 1B).
- On the 1B model, LoRA-Pre Adam outperformed the previous best efficient baseline (Fira) by 1.6 perplexity points.
- LoRA-Pre Muon showed significant gains over standard Muon and other low-rank variants on smaller scales.

B. Fine-Tuning Performance

Benchmark: Fine-tuned Llama-2-7B and Llama-3.1-8B on MetaMath100k, evaluated on GSM8K and MATH-500.
Results: LoRA-Pre consistently outperformed standard LoRA, rsLoRA, DoRA, and GaLore.
- Improvements: Achieved a 3.14 point gain on Llama-3.1-8B (GSM8K) and 6.17 points on Llama-2-7B compared to standard LoRA.

C. Rank Efficiency (Ablation Study)

Key Finding: LoRA-Pre exhibits remarkable rank efficiency.
Comparison: LoRA-Pre with rank $r=16$ achieved performance comparable to GaLore with rank $r=128$ (an 8x reduction in rank requirement) on the 60M model.
Reason: The continuous subspace adaptation mechanism prevents the error accumulation seen in periodic-update methods, allowing the optimizer to function effectively with much smaller subspaces.

5. Significance and Impact

Scalability: LoRA-Pre significantly reduces the memory footprint of training LLMs, making it feasible to train larger models on limited hardware or to use larger batch sizes.
Paradigm Shift: It moves away from the "projection-based" paradigm (which suffers from lag and error accumulation) to a "continuous regression-based" paradigm, offering a more stable and theoretically grounded approach to low-rank optimization.
Generalizability: The method works effectively with different base optimizers (Adam, Muon) and across both pre-training (random initialization) and fine-tuning (adaptation) scenarios, addressing a gap where previous low-rank methods often failed in pre-training.
Practicality: By requiring only 1/8th the rank of competitors for similar performance, LoRA-Pre offers a highly practical solution for memory-constrained deep learning environments.

In summary, LoRA-Pre redefines momentum maintenance as an online learning problem, enabling highly efficient, low-rank optimization that outperforms existing state-of-the-art methods in both memory efficiency and model performance.