Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

The Big Picture: Finding the Flattest Valley

Imagine you are trying to find the best spot to set up a campsite in a vast, mountainous landscape. Your goal is to find a spot that is safe and stable.

Gradient Descent (GD) is like a hiker who always walks straight down the steepest slope. They are efficient but might get stuck in a narrow, sharp ravine. If the ground shakes (noise in the data), they might fall out.
Sharpness-Aware Minimization (SAM) is a smarter hiker. Before taking a step, they look around in a small circle to see how "bumpy" the ground is. They prefer to set up camp in a wide, flat valley rather than a narrow, sharp peak. This usually leads to better generalization (the camp stays safe even if the weather changes).

This paper asks: Does this "smart hiker" (SAM) behave differently depending on how many layers of equipment (depth) they are carrying?

The Discovery: Depth Changes the Rules

The researchers found that for simple, one-layer models, SAM and the standard hiker (GD) end up at the same place. But once you add depth (more layers), SAM starts acting strangely. It develops a unique "personality" that depends heavily on how much gear it started with (initialization).

They discovered a phenomenon they call "Sequential Feature Amplification."

The Analogy: The "Minor First, Major Last" Strategy

Imagine you are a detective trying to solve a crime. You have a list of suspects (features). Some are obvious "Major Suspects" (loud, obvious clues), and some are "Minor Suspects" (quiet, subtle clues).

The Standard Hiker (GD): Immediately ignores the quiet clues and focuses entirely on the loud, obvious Major Suspects. They solve the case by chasing the biggest lead.
The Smart Hiker (SAM) with Medium Gear: This is where it gets weird.
1. Phase 1 (Minor First): At the start of the investigation, SAM ignores the loud suspects. Instead, it obsessively focuses on the Minor Suspects (the quiet, subtle clues). It amplifies these tiny details, treating them as if they are the most important thing in the world.
2. Phase 2 (The Shift): As the investigation continues (or if the hiker started with slightly more gear), it slowly realizes, "Oh, wait, the loud suspects actually matter more." It gradually shifts its attention from the minor clues to the major ones.
3. Phase 3 (Major Last): Eventually, it settles on the Major Suspects, just like the standard hiker.

The Catch: If you only look at the end of the investigation (the final result), you might think SAM and GD did the same thing. But if you watch the process, SAM spent a long time obsessing over the wrong (minor) clues before correcting course.

Why Does This Happen? (The "Noise" Factor)

Why does SAM get distracted by the minor clues?

Think of the "perturbation" in SAM as a shaking hand.

When the hiker is small (small initialization), the shaking hand is very sensitive.
The math in the paper shows that this shaking hand accidentally magnifies the tiny, weak signals (minor features) much more than the strong ones in the early stages.
It's like trying to hear a whisper in a noisy room. If you turn up the volume just a little bit to hear the whisper, the background noise (minor features) gets amplified first. Only when you turn the volume way up (larger initialization or more time) do the loud voices (major features) finally drown out the noise and take over.

The Three Regimes (The "Gear" Settings)

The paper identifies three distinct behaviors based on how much "gear" (initialization scale) the model starts with:

Too Little Gear (Regime 1): The hiker is so small and shaky that they get stuck in the mud. They never solve the case; the model collapses to zero.
Just Right Gear (Regime 2 - The Magic Zone): This is where the "Minor First, Major Last" magic happens. The model starts by amplifying the minor features, creating a long "plateau" where progress seems slow, before suddenly snapping to the correct solution.
Too Much Gear (Regime 3): The hiker is so heavy and stable that they ignore the shaking hand entirely. They behave exactly like the standard hiker (GD) and go straight for the Major Suspects.

Real-World Proof: The "Background" Effect

To prove this isn't just math on paper, the researchers trained AI models on real images (like handwritten digits from MNIST).

GD looked at the white digits (the bright, obvious parts).
SAM (in the "Just Right" zone) looked at the black background (the dark, quiet parts).

It turns out SAM was paying attention to the "minor" background pixels first, treating them as the most important clues, before eventually focusing on the digits. This explains why SAM often generalizes better: by looking at the subtle background details, it learns a more robust understanding of the image, rather than just memorizing the bright shapes.

The Takeaway

"Don't judge a book by its cover (or its final destination)."

The paper teaches us that looking only at the final result of an AI model can be misleading. Even if two models end up at the same solution, the journey matters.

Gradient Descent is a direct, linear path.
SAM takes a winding path, exploring the "minor" details of the data first.

This "Minor First, Major Last" behavior is a hidden superpower of SAM that only appears when the model is deep enough. It suggests that to truly understand how AI learns, we need to watch the training process in real-time, not just look at the final score.

1. Problem Statement

The paper investigates the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$ -layer linear diagonal networks on linearly separable binary classification tasks with logistic loss.

Context: While Gradient Descent (GD) is known to converge to specific max-margin solutions (e.g., $\ell_2$ -max-margin for linear models, $\ell_1$ -max-margin for 2-layer diagonal networks), the implicit bias of SAM is less understood, particularly in deep settings.
Gap: Existing theoretical analyses of SAM often focus on scenarios with finite minimizers (e.g., squared loss) or only analyze the asymptotic limit ( $t \to \infty$ ). This paper addresses the case where the infimum of the loss lies at infinity (logistic loss) and argues that infinite-time analysis is insufficient to explain SAM's behavior in finite time.
Core Question: How does network depth ( $L$ ) and the choice of perturbation norm ( $\ell_2$ vs. $\ell_\infty$ ) alter the optimization trajectory and implicit bias of SAM compared to GD?

2. Methodology

The authors employ a combination of continuous-time flow analysis and finite-time trajectory characterization.

Model: $L$ -layer linear diagonal networks where the predictor is $\beta(\theta) = \bigodot_{\ell=1}^L w^{(\ell)}$ .
Algorithms:
- GD: Standard gradient descent.
- $\ell_\infty$ -SAM: Perturbation in the $\ell_\infty$ -ball.
- $\ell_2$ -SAM: Perturbation in the $\ell_2$ -ball (most common in practice).
Theoretical Framework:
- The authors analyze SAM flows (continuous-time limits of SAM updates).
- They introduce a rescaled flow formulation to simplify the analysis by removing the loss derivative term, focusing purely on the spatial trajectory of the weights.
- They study the evolution of the linear coefficient vector $\beta(t)$ under different initialization scales ( $\alpha$ ) and perturbation radii ( $\rho$ ).
Datasets: Theoretical analysis is primarily conducted on a minimalist single-example dataset $\{(\mu, +1)\}$ with $\mu = (\mu_1, \dots, \mu_d)$ where $0 < \mu_1 < \dots < \mu_d$ . This allows for exact characterization of coordinate-wise dynamics. Empirical validation is performed on synthetic multi-point datasets and real-world data (MNIST, SVHN, CIFAR-10).

3. Key Contributions & Theoretical Findings

A. Depth 1 (Linear Models, $L=1$ )

Result: Both $\ell_\infty$ -SAM and $\ell_2$ -SAM converge to the $\ell_2$ max-margin direction, identical to GD.
Significance: For linear models, SAM does not alter the implicit bias; it behaves similarly to GD regardless of initialization.

B. Depth $\ge 2$ with $\ell_\infty$ -SAM

Result: The implicit bias becomes highly sensitive to initialization.
- Unlike GD, which always aligns with the major feature (the coordinate with the largest $\mu_j$ ), $\ell_\infty$ -SAM can converge to zero or align with minor features (smaller $\mu_j$ ) depending on the initialization scale relative to the perturbation radius $\rho$ .
- Theorem 3.2: The trajectory of each coordinate $\beta_j(t)$ is determined by whether the initial weight $\alpha_j$ is below, equal to, or above the threshold $\rho$ .
- Finite-time blow-up: For $L > 2$ , some coordinates may diverge in finite rescaled time, corresponding to infinite time in the original flow.
Significance: Depth introduces a mechanism where SAM can favor "minor" features, a behavior not seen in GD.

C. Depth 2 with $\ell_2$ -SAM: Sequential Feature Amplification

This is the paper's most significant contribution.

Asymptotic Behavior: Theoretically, if the loss vanishes, $\ell_2$ -SAM converges to the $\ell_1$ max-margin direction (matching GD).
Finite-Time Phenomenon: The authors discover a phenomenon called Sequential Feature Amplification.
- Observation: In finite time, the predictor initially relies on minor coordinates (small $\mu_j$ ) and gradually shifts to major coordinates (large $\mu_j$ ) as training proceeds or as the initialization scale $\alpha$ increases.
- Mechanism: The $\ell_2$ -SAM update includes a gradient normalization factor. In the early stages (small $n_\theta(t)$ ), this factor suppresses major features (large $\mu_j$ ) while comparatively amplifying minor features.
- Three Regimes (Theorem 4.4):
  1. Regime 1 (Small $\alpha$ ): The trajectory collapses to zero; no features are learned.
  2. Regime 2 (Intermediate $\alpha$ ): Sequential Feature Amplification occurs. The dominant feature index transitions from minor to major over time. The loss curve exhibits an early plateau before dropping rapidly.
  3. Regime 3 (Large $\alpha$ ): The predictor aligns with the major feature from the start (similar to GD).
Significance: This demonstrates that infinite-time implicit bias analysis is insufficient. The path taken to reach the solution (the "minor-first" phase) is a distinct characteristic of SAM that affects generalization and feature selection dynamics.

4. Experimental Results

The authors corroborate their theoretical findings with extensive experiments:

Synthetic Data: Heatmaps of the dominant feature index over time ( $t$ ) and initialization scale ( $\alpha$ ) clearly show the "minor-to-major" transition in Regime 2 for $\ell_2$ -SAM, which is absent in GD.
Real-World Data (MNIST, SVHN, CIFAR-10):
- Using Grad-CAM, the authors visualize which image regions the models focus on.
- GD consistently focuses on dominant, high-intensity pixels (e.g., the white digits in MNIST).
- SAM (with intermediate initialization) places significantly more emphasis on minor/background regions (e.g., the black background in MNIST or low-frequency bands in synthetic CNNs).
- This confirms that SAM's "minor-first" bias translates to real-world architectures, potentially explaining its superior generalization by utilizing more diverse features.

5. Significance and Implications

Redefining Implicit Bias: The paper challenges the view that implicit bias is solely defined by the $t \to \infty$ limit. It shows that the finite-time dynamics of SAM are qualitatively different from GD and crucial for understanding its behavior.
Depth-Induced Bias: It isolates depth as a critical factor that changes SAM's behavior from GD-like (in $L=1$ ) to a unique feature-selection mechanism (in $L \ge 2$ ).
Feature Selection: The "Sequential Feature Amplification" suggests SAM acts as a dynamic regularizer that first explores minor features before settling on major ones. This could explain why SAM often generalizes better than GD, as it avoids premature convergence to dominant but potentially overfitted features.
Practical Guidance: The findings suggest that the initialization scale is a critical hyperparameter for SAM. Intermediate initialization scales induce the unique "minor-first" behavior, while very large scales make SAM behave like GD.

In summary, the paper provides a rigorous theoretical and empirical explanation for why SAM behaves differently from GD in deep networks, identifying a novel "minor-first" feature amplification mechanism that is depth-dependent and time-sensitive.

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

The Big Picture: Finding the Flattest Valley

The Discovery: Depth Changes the Rules

The Analogy: The "Minor First, Major Last" Strategy

Why Does This Happen? (The "Noise" Factor)

The Three Regimes (The "Gear" Settings)

Real-World Proof: The "Background" Effect

The Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions & Theoretical Findings

A. Depth 1 (Linear Models, L=1L=1L=1)

B. Depth ≥2\ge 2≥2 with ℓ∞\ell_\inftyℓ∞​-SAM

C. Depth 2 with ℓ2\ell_2ℓ2​-SAM: Sequential Feature Amplification

4. Experimental Results

5. Significance and Implications

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions

A. Depth 1 (Linear Models, $L=1$ )

B. Depth $\ge 2$ with $\ell_\infty$ -SAM

C. Depth 2 with $\ell_2$ -SAM: Sequential Feature Amplification