Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence

Imagine you are trying to find the deepest point in a vast, foggy valley (the "target distribution"). This valley represents a complex problem in machine learning, like training an AI to recognize cats or predicting stock prices. The deeper you go, the better your solution.

To find this bottom, you use a method called Langevin Monte Carlo. Think of this as a hiker trying to find the valley floor.

The Two Types of Hikers

The Overdamped Hiker (OLD): This hiker is very heavy and moves slowly. Every time they take a step, they are immediately stopped by thick mud (friction). They only look at the slope right under their feet and shuffle forward. They are safe and steady, but in a huge, high-dimensional valley (with thousands of dimensions), they get stuck or take forever to find the bottom. Their speed depends heavily on how wide the valley is (the dimension $d$ ).
The Underdamped Hiker (ULD): This hiker has momentum. When they step, they don't stop immediately; they glide a bit. They can bounce over small bumps and roll down slopes faster. In practice, this "momentum" hiker is much faster. However, for a long time, mathematicians couldn't prove exactly how fast they were in the worst-case scenarios, especially when the valley was massive.

The Problem: The "Dimension Curse"

For years, the mathematical proof for how fast the Underdamped Hiker converges (finds the bottom) had a fatal flaw: it depended on the size of the valley.

If the valley had 10 dimensions, the proof said the hiker would take $X$ steps. If the valley had 1,000,000 dimensions (common in modern AI), the proof said the hiker would take $X \times 1,000,000$ steps. This made the math useless for real-world AI, because the "guarantee" became so large it was meaningless (a "vacuous bound").

It was like saying, "This car can drive 100 miles per hour, but only if the road is 1 inch wide. If the road is 100 miles wide, the car might take a billion years."

The Breakthrough: Ignoring the Width, Focusing on the Shape

This paper solves that problem. The authors, Zhang, Di, Li, and Gu, prove that the Underdamped Hiker's speed does not actually depend on the width of the valley (the dimension $d$ ). Instead, it depends on the shape of the hills inside the valley.

Here is the creative analogy:

The Old Way (Dimension Dependent): Imagine trying to count every single grain of sand in a beach to know how long it will take to walk across it. If the beach is huge, the count is huge.
The New Way (Dimension Independent): The authors realized you don't need to count every grain of sand. You only need to know the total weight of the sand (represented by a mathematical value called tr(H)).
- If the beach is wide but the sand is very light (low "trace"), you can walk across it quickly.
- If the beach is narrow but the sand is incredibly heavy (high "trace"), it will be slow.

They proved that the hiker's speed is determined by the total weight of the terrain, not the number of dimensions. This is a massive improvement because in many real-world AI problems, the "weight" of the terrain is much smaller than the number of dimensions.

The Tools They Used

To prove this, they used two clever tricks:

The "Smart Step" (Randomized Midpoint): Instead of just looking at the slope right under their feet, the hiker occasionally takes a "guess" at a random spot halfway through their step to see what the slope looks like there. This is like peeking ahead in the fog to get a better sense of the path. This technique (called Randomized Midpoint Discretization) makes the hiker even more efficient.
The "Change of Clothes" (Change-of-Measure): In math, to compare two different paths, you sometimes have to pretend the hiker is wearing different shoes or walking on a different surface to make the comparison fair. The authors refined this technique so they didn't accidentally add "extra weight" (dimension dependence) to the calculation. They managed to keep the math "light" and independent of the valley's size.

Why This Matters

Faster AI Training: This means we can theoretically guarantee that AI models will train faster and more reliably, even when they have millions of parameters (dimensions).
Better Guarantees: Before this, we had to hope the Underdamped Hiker was fast. Now, we have a mathematical proof that says, "As long as the hills aren't too heavy, you will get to the bottom quickly, regardless of how wide the valley is."
KL Divergence: They measured the success using a specific metric called "KL Divergence" (think of it as a measure of how "confused" the hiker is about where they are). They proved the hiker becomes less confused much faster than previously thought possible.

The Bottom Line

This paper is like discovering a new law of physics for hikers in high-dimensional valleys. It tells us that momentum (Underdamped Langevin) is not just a heuristic trick that works well in practice; it is mathematically proven to be incredibly efficient, provided the "weight" of the problem isn't too heavy. It removes the fear that high-dimensional problems are unsolvable, showing us that the complexity lies in the structure of the problem, not just its size.

1. Problem Statement

The paper addresses a critical gap in the theoretical understanding of Underdamped Langevin Monte Carlo (ULMC), a popular sampling algorithm for high-dimensional Gibbs distributions $\pi(x) \propto e^{-V(x)}$ .

The Challenge: While ULMC is empirically effective and often converges faster than Overdamped Langevin Monte Carlo (OLD), existing non-asymptotic convergence guarantees for discretized ULMC typically scale polynomially with the ambient dimension $d$ . This makes the bounds vacuous in high-dimensional settings where $d$ is large, even if the intrinsic complexity of the potential function $V$ is low-dimensional.
The Gap: Prior work had established dimension-free convergence for specific schemes in the Wasserstein-2 distance (e.g., Liu et al., 2023) and for Overdamped Langevin in KL divergence (e.g., Freund et al., 2022). However, dimension-independent convergence guarantees for discretized ULMC in KL divergence remained an open problem.
Why KL Divergence? In the strongly log-concave setting, convergence in KL divergence is strictly stronger than convergence in Wasserstein distance or Total Variation (via Talagrand's $T_2$ and Pinsker's inequalities). Establishing dimension-free KL bounds is therefore a more robust theoretical achievement.

2. Methodology

The authors propose a refined analysis framework that replaces the standard dependence on dimension $d$ with dependence on the trace of the Hessian upper bound, denoted as $\text{tr}(H)$ , where $H \succeq \nabla^2 V$ .

A. Theoretical Framework: KL Local Error

The analysis builds upon the KL local error framework introduced by Altschuler et al. (2025). This framework reduces global convergence analysis to verifying local assumptions:

Strong and Weak Local Errors: Bounds on the one-step discretization error.
Cross-Regularity: Bounds on the divergence between transition kernels starting from different initial conditions.
Shifted Chain Rule: A technique using an auxiliary process to interpolate between the numerical scheme and the continuous dynamics.

B. Key Technical Innovations

To achieve dimension-free results, the authors introduce two major refinements to the standard analysis:

H-Weighted Norms for Error Bounds:
- Standard analyses use the Euclidean norm ( $\|\cdot\|_2$ ), which leads to $\sqrt{d}$ factors when bounding Gaussian noise terms.
- The authors instead utilize the $H$ -norm ( $\|x\|_H = \sqrt{x^\top H x}$ ). They prove that the strong and weak local errors for both standard ULMC and Randomized Midpoint Discretization (RMD) scale with $\|p\|_H$ and $\sqrt{\text{tr}(H)}$ rather than $\|p\|_2$ and $\sqrt{d}$ .
- Result: Lemma 4.1 and Lemma 5.1 establish error bounds where the dimension $d$ is replaced by $\text{tr}(H)$ .
Dimension-Free Change-of-Measure:
- Controlling state-dependent terms (like $\mathbb{E}[\|\nabla V(x)\|^2]$ and $\mathbb{E}[p^\top H p]$ ) usually requires bounding moments of the distribution, which typically introduces $d$ .
- The authors employ a refined Donsker-Varadhan variational formula combined with a Taylor expansion of the exponential moment generating function.
- Instead of bounding the log-moment generating function directly (which yields $d$ ), they bound the expectation of each order of the expansion separately. This allows them to show that $\mathbb{E}_\mu[\|\nabla V(x)\|^2] \leq \text{tr}(H) + \beta \cdot \text{KL}(\mu \| \pi)$ , effectively eliminating explicit dimension dependence.

3. Key Contributions

The paper provides the first dimension-free KL convergence guarantees for ULMC in both Strongly Convex and General Convex settings.

A. Strongly Convex Setting ( $\alpha > 0$ )

Standard ULMC: Achieves an iteration complexity of $\tilde{O}\left(\kappa^{3/2} \beta^{-1/2} [\text{tr}(H)]^{1/2} / \epsilon\right)$ $\tilde{O} (κ^{3/2} β^{- 1/2} [tr (H)]^{1/2} / ϵ)$ .
- Improvement: This strictly improves the condition number dependence ( $\kappa^{3/2}$ ) compared to the best known Wasserstein dimension-free bound for ULD ( $\kappa^{5/3}$ in Liu et al., 2023).
Randomized Midpoint Discretization (RMD): Achieves $\tilde{O}\left(\kappa [\beta^{-1} \text{tr}(H)]^{1/3} \epsilon^{-2/3}\right)$ $\tilde{O} (κ [β^{- 1} tr (H)]^{1/3} ϵ^{- 2/3})$ .
- Improvement: Improves the condition number dependence compared to the Wasserstein bound in Liu et al. (2023) while maintaining the same dependence on $\text{tr}(H)$ and $\epsilon$ .

B. General Convex Setting ( $\alpha = 0$ )

Standard ULMC: Establishes the first dimension-free KL bound for ULMC in this setting, with complexity $\tilde{O}\left(\beta \sqrt{\text{tr}(H)} W / \epsilon^4\right)$ (where $W$ is the initial Wasserstein distance).
Randomized Midpoint Discretization (RMD): Achieves a complexity of $\tilde{O}\left(\beta \text{tr}(H)^{1/4} W^{5/2} / \epsilon^3\right)$ $\tilde{O} (β tr (H)^{1/4} W^{5/2} / ϵ^{3})$ .
- Significance: This improves the convergence rate from $O(1/\epsilon^4)$ (standard ULMC) to $O(1/\epsilon^3)$ , matching the state-of-the-art rates for overdamped methods in this setting but now applied to underdamped dynamics with dimension-free guarantees.

4. Results Summary

The paper summarizes the sample complexity results in Table 1, highlighting the shift from dependence on $d$ to $\text{tr}(H)$ .

Dynamics	Discretization	Metric	Setting	Complexity (New Result)	Dimension-Free?
ULD	ULMC	KL	Strongly Convex	$\tilde{O}(\kappa^{3/2} \beta^{-1/2} [\text{tr}(H)]^{1/2} / \epsilon)$	Yes
ULD	RMD	KL	Strongly Convex	$\tilde{O}(\kappa [\beta^{-1} \text{tr}(H)]^{1/3} \epsilon^{-2/3})$	Yes
ULD	ULMC	KL	General Convex	$\tilde{O}(\beta \text{tr}(H)^{1/2} W / \epsilon^4)$	Yes
ULD	RMD	KL	General Convex	$\tilde{O}(\beta \text{tr}(H)^{1/4} W^{5/2} / \epsilon^3)$	Yes

Note: $\kappa = \beta/\alpha$ is the condition number, $\beta$ is the smoothness constant, and $W$ is the initial Wasserstein distance.

5. Significance and Impact

Theoretical Breakthrough: This work resolves the open problem of dimension-free KL convergence for underdamped Langevin dynamics. It demonstrates that the "curse of dimensionality" in sampling can be mitigated by leveraging the spectral properties of the Hessian ( $\text{tr}(H)$ ) rather than the ambient dimension.
Practical Relevance: In many high-dimensional machine learning problems (e.g., ridge-separable potentials), $\text{tr}(H) \ll d$ . The new bounds imply that ULMC can be highly efficient in these regimes, providing a theoretical justification for its empirical success.
Methodological Advancement: The introduction of H-weighted norms in local error analysis and the refined change-of-measure technique offers a new toolkit for analyzing stochastic differential equation discretizations. These techniques can likely be applied to other sampling algorithms to derive tighter, dimension-independent bounds.
Superiority over Overdamped Methods: The results show that underdamped methods (especially with RMD) can achieve better dependence on the condition number $\kappa$ and error tolerance $\epsilon$ compared to their overdamped counterparts in specific regimes, further validating the use of momentum in sampling.

In conclusion, the paper successfully bridges the gap between the empirical efficiency of underdamped Langevin dynamics and rigorous non-asymptotic theory, providing the first dimension-free convergence guarantees in the strong KL divergence metric.

Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence

The Two Types of Hikers

The Problem: The "Dimension Curse"

The Breakthrough: Ignoring the Width, Focusing on the Shape

The Tools They Used

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

A. Theoretical Framework: KL Local Error

B. Key Technical Innovations

3. Key Contributions

A. Strongly Convex Setting (α>0\alpha > 0α>0)

B. General Convex Setting (α=0\alpha = 0α=0)

4. Results Summary

5. Significance and Impact

More like this

Varying risk exposure in auto insurance: a weighted tweedie framework for experience rating an cancellation penalties

Horseshoe Priors and MDP

Observable Geometry of Singular Statistical Models

Conditional Independence under Infinite Measures and Poisson Point Processes

Sharp Debiasing for Smooth Functional Estimation in Banach Spaces

A. Strongly Convex Setting ( $\alpha > 0$ )

B. General Convex Setting ( $\alpha = 0$ )