The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Imagine you are trying to find the perfect spot to set up a campfire in a vast, foggy forest. You want the spot to be safe (generalizable) and not too close to any trees (avoiding overfitting). You have a compass (the optimizer) that tells you which way is "downhill" toward the best spot.

For a long time, researchers thought everyone used the same compass: Gradient Descent. They knew this compass had a secret habit (an "implicit bias"): it tended to lead you to a spot that was not just safe, but specifically the safest spot relative to the distance from the nearest tree, measured in a standard "straight-line" way (like a ruler).

But in recent years, people started using fancier, more complex compasses like Adam and Muon. These are popular because they get to the campfire faster and handle tricky terrain better. However, nobody was sure if these fancy compasses were leading you to the same kind of safe spot, or if they had their own secret habits that might lead you to a different, potentially less safe, location.

This paper is like a detective story where the authors investigate the "personality" of these new compasses. They ask: When these fancy optimizers finish their journey, where do they actually end up, and why?

Here is the breakdown of their findings using simple analogies:

1. The "Smooth" Forest vs. The "Rough" Forest

The authors focus on a specific type of forest called Smooth Homogeneous Models.

Homogeneous means the forest looks the same no matter how far you zoom in or out. If you double the size of your map, the terrain doubles too.
Smooth means the ground is continuous, like a gentle hill, rather than jagged rocks (which would be like ReLU networks with sharp corners).

They proved that if you use a "steep descent" compass (a basic version of the fancy ones) on this smooth forest, it will always lead you to the spot that maximizes the margin.

The Margin Analogy: Imagine the trees are obstacles. The "margin" is the width of the clear path you have between you and the nearest tree. Maximizing the margin means finding the path with the widest possible buffer zone. The wider the buffer, the safer you are if a tree suddenly grows a bit.

2. The Secret Habits of the New Compasses

The paper reveals that the fancy compasses (Adam and Muon) aren't just random wanderers. They are actually "disguised" versions of the basic steep descent compass, but they are wearing different shoes.

Muon (The Spectral Walker):
- What it does: Muon is designed to handle large blocks of data (like matrices) very efficiently.
- Its Bias: It acts like a hiker who measures distance using Spectral Norms. Imagine a hiker who doesn't care about the total distance walked, but only cares about the "widest step" they ever took in a specific direction.
- The Result: Muon leads you to the spot that maximizes the margin based on this "widest step" rule. It's like finding the path where your biggest single stride is as far from the trees as possible.
Adam (The "Sign" Walker):
- What it does: Adam is famous for adjusting its step size based on how steep the hill is.
- Its Bias: The authors found that when Adam runs without its "safety net" (a tiny constant usually added to prevent division by zero), it behaves almost exactly like Signum (Sign Gradient Descent).
- The Analogy: Imagine a hiker who only cares about the direction of the wind, not how hard it blows. If the wind pushes you left, they go left, regardless of whether it's a gentle breeze or a hurricane.
- The Result: This "direction-only" hiker maximizes the $\ell_\infty$ margin. In plain English, this means Adam finds the spot where the single most critical distance to a tree is maximized. It ignores the average distance and focuses entirely on the one "weakest link" or the most dangerous tree.

3. The "Hybrid" Compasses

The paper also looked at Muon-Adam, a combination where you use Muon for the big weight matrices and Adam for the smaller parameters.

The Result: This hybrid compass creates a "hybrid margin." It finds a spot that is safe according to both rules simultaneously. It's like finding a campsite that satisfies the "widest step" rule for the big trees and the "single most dangerous tree" rule for the small bushes.

4. The "Decaying Learning Rate" Secret

A crucial part of the story is the Learning Rate.

Imagine you are walking down a hill. At first, you take giant, confident strides. As you get closer to the bottom (the solution), you take smaller and smaller steps to avoid overshooting.
The authors proved that this "slowing down" (decaying learning rate) is the magic ingredient. It forces these fancy compasses to eventually align perfectly with the "steep descent" path, revealing their true bias. Without slowing down, they might just spin in circles or get stuck in a weird spot.

Why Does This Matter?

Think of Generalization as the ability of your AI to handle new, unseen data (like finding a campfire spot in a forest you've never visited before).

The Old View: We thought all optimizers just wanted to maximize the "standard" safety margin.
The New View: Different optimizers maximize different kinds of safety margins.
- If you use Adam, you are implicitly telling your model: "Prioritize the safety of the single most critical data point."
- If you use Muon, you are saying: "Prioritize the safety based on the largest structural feature of the data."

The Takeaway:
Choosing an optimizer isn't just about speed; it's about choosing which kind of safety you want your AI to prioritize. The paper gives us the map to understand exactly where each compass will lead us, allowing us to pick the right tool for the specific terrain of our problem.

In short: The optimizer you choose secretly decides the shape of the "safety zone" your AI learns to live in.

1. Problem Statement

Deep neural networks often generalize well despite being overparameterized and trained without explicit regularization. A prevailing explanation is the implicit bias of optimization algorithms, which tend to converge to solutions that maximize the margin on training data.

While the implicit bias of Gradient Descent (GD) is well-understood (maximizing the $\ell_2$ margin in homogeneous models), the behavior of modern, momentum-based optimizers like Adam and Muon remains less explored, particularly in non-linear settings. Previous work on these optimizers was largely restricted to linear models. This paper addresses the gap by analyzing the implicit bias of Adam, Muon, and their variants on smooth homogeneous neural networks.

2. Methodology and Framework

The authors develop a unified theoretical framework based on Approximate Steepest Descent to analyze the convergence direction of various optimizers.

A. Setting and Assumptions

Models: Smooth homogeneous models $f(x; \theta)$ where $f(x; \alpha\theta) = \alpha^L f(x; \theta)$ for $L \geq 1$ . This includes deep linear networks and networks with smooth non-linear activations (e.g., squared ReLU).
Loss: Log-concave, exponentially tailed losses (e.g., exponential loss $\ell(u)=e^{-u}$ , logistic loss).
Learning Rate: A decaying learning rate schedule $\eta(t)$ such that $\int_0^\infty \eta(t) dt = \infty$ and $\eta(t)$ decays sufficiently fast (e.g., $o(t^{1/L - 1})$ ).
Assumptions: The analysis assumes the trajectory $\theta_t$ does not converge to the origin and that the direction $\frac{\theta_t}{\|\theta_t\|}$ converges to a limit $\bar{\theta}$ with a positive margin.

B. Theoretical Core: Approximate Steepest Descent

The authors define a trajectory as Approximate Steepest Descent if, asymptotically, the update direction aligns with the negative subgradient direction under a specific norm.

Key Insight: They prove that if a trajectory is an approximate steepest descent with respect to a norm $\|\cdot\|$ , and the loss decays to zero, any limit point of the normalized parameters is a KKT point of the margin maximization problem:
$\min_{\theta} \frac{1}{2}\|\theta\|^2 \quad \text{s.t.} \quad y_i f(x_i; \theta) \geq 1, \forall i$
Momentum Analysis: A crucial technical contribution is showing that for momentum-based optimizers with decaying learning rates, the momentum estimate $m_t$ asymptotically tracks the significant coordinates of the gradient $g_t$ (i.e., $m_t \approx g_t$ in direction and magnitude for large components). This allows momentum algorithms to be treated as approximate steepest descent.

3. Key Contributions and Results

The paper extends implicit bias results from linear models to smooth homogeneous networks for several optimizers:

A. Normalized Steepest Descent (NSD)

Result: The authors extend existing results to show that any limit point of normalized steepest descent (with a learning rate schedule) converges to the direction of a KKT point of the max-margin problem defined by the dual norm $\|\cdot\|_\star$ .
Significance: This generalizes previous work (e.g., Tsilivis et al., 2025) to include learning rate schedules and establishes the baseline for momentum variants.

B. Momentum Steepest Descent (MSD), Muon, and Muon-Signum

Muon: Muon performs orthogonalization on the momentum estimate of weight matrices. The authors prove that Muon corresponds to normalized momentum steepest descent with respect to the max-spectral norm ( $\|\cdot\|_{msp} = \max_k \|W_k\|_{sp}$ $∥ \cdot ∥_{m s p} = max_{k} ∥ W_{k} ∥_{s p}$ ).
- Implicit Bias: Muon maximizes the margin defined by the dual of the max-spectral norm (the sum of nuclear norms).
Muon-Signum: A composite algorithm running Muon on matrices and Signum (sign gradient descent) on vector parameters.
- Implicit Bias: It maximizes the margin defined by the hybrid norm $\|\theta\| = \max(\|W\|_{msp}, \|u\|_\infty)$ .
Generalization: Any collection of normalized momentum steepest descent algorithms running in parallel on different parameter subsets converges to the max-margin solution of the maximum of their respective norms.

C. Adam and Muon-Adam

Adam: Unlike MSD, Adam is not a normalized steepest descent algorithm due to the ratio of momentum estimates. However, the authors prove that under decaying learning rates, Adam behaves as approximate steepest descent with respect to the $\ell_\infty$ norm.
- Implicit Bias: Adam (without the stability constant $\epsilon$ ) maximizes the $\ell_\infty$ margin.
Muon-Adam: A hybrid optimizer using Muon for weight matrices and Adam for other parameters.
- Implicit Bias: It maximizes the margin defined by the hybrid norm $\|\theta\| = \max(\frac{\eta_A}{\eta_M}\|W\|_{msp}, \|u\|_\infty)$ , where $\eta_A, \eta_M$ are the base learning rates for Adam and Muon components.

D. Non-Smooth Models (ReLU)

The authors discuss the extension to non-smooth models (like ReLU networks). While the primary proofs assume smoothness, they show that the results hold if the normalized subgradients converge (Assumption T3). Experiments suggest this holds empirically, though a formal proof for general ReLU networks without this assumption remains an open question.

4. Experimental Validation

The authors validate their theory using two-layer homogeneous networks trained on MNIST (even/odd classification) with exponential loss.

Optimizers Tested: Normalized GD (with/without momentum), Signum, Adam, Muon, and Muon-Adam.
Findings:
- NGD: Converges to the $\ell_2$ max-margin solution.
- Signum & Adam: Converge to the $\ell_\infty$ max-margin solution.
- Muon: Converges to the max-spectral norm ( $\|\cdot\|_{msp}$ ) max-margin solution.
- Muon-Adam: Converges to the hybrid norm max-margin solution.
Observation: The experiments confirm that the specific norm maximized depends strictly on the choice of optimizer, even when the network architecture and data remain identical.

5. Significance and Impact

Unification: The paper provides a unified framework ("Approximate Steepest Descent") that explains the implicit bias of a wide class of first-order optimizers, bridging the gap between classical GD and modern adaptive methods.
Practical Implications: It clarifies that choosing an optimizer is not just a matter of convergence speed but fundamentally alters the inductive bias of the model. For instance, using Adam vs. Muon on the same architecture leads to solutions with different geometric properties (e.g., sparsity vs. spectral properties).
Theoretical Advancement: It moves the field beyond linear models, providing the first rigorous analysis of Adam and Muon's implicit bias in smooth non-linear homogeneous networks.
Future Directions: The work opens questions regarding the provable bias in non-smooth (ReLU) networks without trajectory assumptions and the implications of these biases for adversarial robustness and data reconstruction attacks.

In summary, this paper rigorously demonstrates that Adam maximizes the $\ell_\infty$ margin and Muon maximizes the spectral norm margin in smooth homogeneous networks, extending the understanding of implicit bias from simple gradient descent to the complex, momentum-based optimizers dominating modern deep learning.