Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent

The Big Question: Is the AI "Thinking" or Just "Guessing"?

Imagine you are trying to teach a robot to recognize cats. You use an algorithm called Stochastic Gradient Descent (SGD). Think of SGD as a blind hiker trying to find the lowest point in a massive, foggy mountain range (the "loss landscape"). The hiker takes small steps downhill based on the slope they feel under their feet.

For a long time, scientists have wondered: Is this hiker just stumbling around randomly, or is there a hidden mathematical rule that makes them act like a super-smart Bayesian statistician?

A "Bayesian" approach is like a detective who considers every possible clue and calculates the exact probability of every scenario. The paper asks: Does our blind hiker (SGD) accidentally end up doing the same thing as the super-smart detective?

The Answer: It's Like Hiking Through a Porous Cave

The authors say: Yes, but with a twist. The hiker isn't walking on a smooth, flat plain. They are walking through a porous cave system (like a sponge or a coral reef).

Here is the breakdown of their discovery:

1. The Terrain is Weird (Singular Learning Theory)

In normal math, we assume the bottom of the mountain is a perfect bowl. But in deep learning, the bottom is often a flat, degenerate valley.

The Analogy: Imagine a valley that isn't just a bowl, but a vast, flat swamp. In some parts of the swamp, the ground is solid rock (easy to walk on). In other parts, it's deep, sticky mud (hard to move).
The Science: The paper uses Singular Learning Theory to measure how "sticky" or "porous" different parts of the valley are. They call this the Local Learning Coefficient (LLC).
- Low LLC: A wide, open, easy-to-walk area (a "flat" minimum).
- High LLC: A narrow, tight, difficult-to-navigate area.

2. The Hiker Moves Like Diffusion in a Sponge

Usually, we think of random movement (diffusion) like a drop of ink spreading evenly in water. But in a sponge, the ink spreads differently depending on the holes.

The Discovery: The hiker (SGD) doesn't move in a straight line or a simple circle. They move in a fractal pattern.
The Metaphor: Imagine the hiker is a drop of water trying to soak through a sponge.
- If the sponge has big holes (low LLC), the water spreads fast.
- If the sponge has tiny, winding tunnels (high LLC), the water gets stuck and moves slowly.
- The paper proves that the hiker's movement is governed by the geometry of these holes.

3. The "Tempered" Bayesian Posterior

This is the most exciting part. The paper shows that after a long time, the hiker settles into a specific pattern.

The Bayesian View: A perfect detective would visit every spot in the valley, but they would spend more time in the "best" spots (low loss) and less time in the "bad" spots.
The SGD Reality: The hiker wants to visit the best spots, but they are physically constrained by the "porous" nature of the terrain.
The Result: The hiker ends up in a distribution that looks exactly like the Bayesian detective's map, but "tempered" (adjusted) by how hard it is to get there.
- If a great solution is in a deep, narrow cave (hard to reach), the hiker might not find it as often as the detective predicts.
- If a good solution is in a wide, open field (easy to reach), the hiker will find it very often.

Why Does This Matter? (The "So What?")

It Explains Why AI Generalizes: It tells us that AI models don't just find any solution; they find solutions that are accessible. They get stuck in the "wide, flat valleys" because it's easier to walk there. These wide valleys happen to be the ones that generalize well (work well on new data).
It Connects Two Worlds: It bridges the gap between Physics (how particles move through porous materials) and Statistics (Bayesian inference). It says: "The way the AI learns is physically determined by the shape of the math."
It's Not Perfect (Yet): The paper admits this works best for standard SGD. If you use fancy, adaptive tools (like Adam), the "terrain" changes shape, and the rules get more complicated. But for the standard hiker, the map is now drawn.

Summary Analogy: The Treasure Hunt

Imagine you are looking for gold (the best AI model) in a giant, complex cave system.

The Bayesian Detective has a perfect map and knows exactly where the gold is. They calculate the probability of finding gold in every cave.
The SGD Hiker is blindfolded and just walks downhill.
The Paper's Insight: Even though the hiker is blind, the shape of the cave (the porous geometry) forces them to spend the most time in the areas where the gold is likely to be. The hiker's path naturally mimics the detective's map, adjusted for the fact that some gold-filled caves are too narrow to enter.

In short: The paper proves that the "blind" process of training AI is actually a very structured, physics-driven journey that naturally leads to smart, generalizable solutions, provided you understand the "porous" nature of the mathematical landscape.

1. Problem Statement

The paper addresses a long-standing open question in deep learning theory: What is the precise relationship between Stochastic Gradient Descent (SGD) and Bayesian sampling?

The Conflict: Classical Bayesian methods (like the Bayesian Information Criterion, BIC) assume that the loss landscape is "regular" (non-degenerate, quadratic minima). However, neural networks are "singular" models where the loss landscape contains degenerate minima (flat regions, saddle points) and non-identifiable parameters. Consequently, classical BIC fails to describe the generalization behavior of deep learning.
The Gap: While Singular Learning Theory (SLT) provides a rigorous framework for singular models using the Learning Coefficient ( $\lambda$ ), it is a static, Bayesian framework. It does not inherently explain the dynamics of how SGD traverses this landscape. Previous attempts to link SGD to Bayesian inference (e.g., via Langevin dynamics) assumed quadratic minima, which is false for neural networks.
The Goal: To derive a dynamical theory that explains how SGD behaves in the late stages of training on singular loss surfaces and to establish a quantitative link between SGD's steady-state distribution and the Bayesian posterior.

2. Methodology

The authors model the long-runtime behavior of SGD as diffusion on porous media, utilizing tools from Singular Learning Theory (SLT) and fractional calculus.

A. Fractional Fokker-Planck Equation (FFPE)

Instead of the standard Langevin equation (which implies Brownian motion, $R(t) \propto t^{1/2}$ ), the authors observe that SGD exhibits anomalous diffusion:

Early training: Super-diffusive behavior.
Late training: Sub-diffusive behavior ( $R(t) \propto t^{1/\nu}$ where $\nu \ge 2$ ).

To capture this, they replace the standard time derivative in the Fokker-Planck equation with a Caputo fractional derivative ( $D^\alpha_t$ ):
$D^\alpha_t p(w, t) = \nabla \cdot (D(w, t)\nabla p(w, t) - \gamma p(w, t)\nabla L_m[w])$
where $0 < \alpha < 1$ . This accounts for the "memory" and sub-diffusive nature of SGD in degenerate landscapes.

B. Linking Geometry to Dynamics via Fractal Dimensions

The core theoretical innovation is mapping the geometric properties of the loss surface (from SLT) to the dynamic properties of diffusion:

Local Learning Coefficient ( $\lambda$ ): In SLT, $\lambda$ acts as a mass (fractal) dimension, describing the volume of low-loss parameters near a singularity.
Spectral Dimension ( $d_s$ ): Describes how fast a diffusive process explores the space (the "dimension that diffusion sees").
Walk Dimension ( $d_{walk}$ ): Describes the scaling of displacement $R(t) \propto t^{1/d_{walk}}$ .

The authors derive the Alexander-Orbach (AO) relation for this context:
$d_{walk} = \frac{2\lambda}{d_s}$
This links the local geometry ( $\lambda$ ) to the dynamics ( $d_s$ and $d_{walk}$ ).

C. Effective Diffusion Coefficient

They approximate the diffusion tensor as a scalar function dependent on the local learning coefficient and a characteristic length scale $\xi$ :
$D_\xi(w) \approx \xi^{2 - d_{walk}} = \xi^{2 - \frac{2\lambda(w)}{d_s}}$
This implies that diffusion is slower in regions with high $\lambda$ (complex, degenerate minima) and faster in regions with low $\lambda$ .

D. Steady-State Solution

Solving the steady-state of the FFPE ( $D^\alpha_t p = 0$ ), they derive the probability distribution of SGD weights:
$p_{ss}(w) \propto e^{-\frac{\gamma L(w)}{D_\xi(w)}}$
Substituting the expression for $D_\xi(w)$ , they show that the SGD steady state is a tempered version of the Bayesian posterior:
$p_{ss}(w) \propto p(X|w) \cdot (D_\xi(w))$
This means SGD does not sample exactly from the Bayesian posterior; it biases the sampling based on local accessibility (how easy it is for the optimizer to reach that region given the local geometry).

3. Key Contributions

Theoretical Framework: Established a rigorous connection between SGD dynamics and Singular Learning Theory by modeling SGD as fractional diffusion on a porous manifold defined by the learning coefficient.
The "Tempered" Posterior: Proved that under reasonable hyperparameters, the steady-state distribution of SGD is effectively a Bayesian posterior tempered by the local learning coefficient (accessibility constraints). This explains why SGD finds solutions that generalize well (low $\lambda$ ) even without explicit Bayesian priors.
Fractal Dynamics: Derived the relationship $d_{walk} = 2\lambda/d_s$ , providing a theoretical basis for the observed sub-diffusive behavior of neural network weights.
Empirical Validation: Validated the theory across diverse architectures (MLPs, ResNets, VGG, TinyLlama) and datasets (MNIST, TinyImageNet, TinyStories), showing strong correlations between predicted and actual weight displacements.

4. Experimental Results

The authors conducted extensive experiments to verify their theoretical predictions:

Sub-diffusion Verification: Measured weight displacement $R(t)$ across models. Found that $R(t)$ follows a power law consistent with sub-diffusion ( $\nu \ge 2$ ) in the late stages of training.
Spectral Dimension vs. Learning Coefficient: Verified the inequality $d_s \le \bar{\lambda}$ (Corollary 3.3), showing that the spectral dimension is bounded by the average local learning coefficient.
Posterior Concentration:
- Compared SGD solutions against SGLD (Stochastic Gradient Langevin Dynamics) approximations of the true Bayesian posterior.
- Found that raw SGD solutions concentrate in regions of low $\lambda$ (flat minima).
- When the SGD distribution was "tempered" using the derived $D_\xi$ factor, it almost perfectly matched the SGLD Bayesian posterior (KL divergence $\approx 0.009$ ).
Robustness: The theory held for models fine-tuned from pre-trained weights and for models trained with adaptive optimizers (like Adam) during the initial phase, provided the final stage used low-learning-rate SGD.

5. Significance and Implications

Unifying Theory: The paper bridges the gap between the static, geometric view of SLT and the dynamic, optimization view of SGD. It suggests that SGD is "Almost Bayesian"—it approximates the posterior but is constrained by the physical accessibility of the loss landscape.
Generalization Insight: It provides a mechanistic explanation for why SGD generalizes well: the dynamics naturally favor regions with low learning coefficients (flat, accessible minima) because the "diffusion" is faster and more stable there.
Practical Applications:
- Transfer Learning: The learning coefficient at initialization can predict the "width" of the basin, informing learning rate and batch size choices.
- Optimizer Design: Suggests that learning rate schedules could be designed to manipulate the spectral dimension ( $d_s$ ) to encourage exploration early and localization late.
- Uncertainty Quantification: Offers a way to correct Bayesian approximations for the degeneracy of neural networks, leading to more accurate predictive intervals.

Conclusion

The authors successfully demonstrate that the late-stage dynamics of SGD are governed by the singular geometry of the loss surface. By modeling this as fractional diffusion, they prove that SGD converges to a distribution that is a "tempered" Bayesian posterior, where the tempering factor is determined by the local learning coefficient. This provides a foundational theory for understanding how deep learning models acquire "knowledge" and generalize.