Entropic Mirror Descent for Linear Systems: Polyak's Stepsize and Implicit Bias

Imagine you are trying to solve a massive puzzle where you have to find the right combination of ingredients (numbers) to bake a perfect cake (solve a math equation). The problem is, you have thousands of ingredients, but only a few are actually needed to make the cake taste right. You want to find the simplest recipe that works—using as few ingredients as possible. This is what mathematicians call finding a "sparse" solution.

This paper is about a specific, clever way of solving these puzzles, called Entropic Mirror Descent, and how to make it run faster and more reliably.

Here is the breakdown of the paper's ideas using everyday analogies:

1. The Problem: The "Infinite Shelf"

Usually, when you try to solve these math puzzles, you use a method called Gradient Descent. Imagine you are walking down a hill to find the lowest point (the best solution). Gradient Descent is like taking small steps downhill.

However, the method in this paper is different. It's like walking on a special, slippery surface (the "Entropy" surface) where your steps are multiplied rather than just added.

The Catch: This surface is weird. It stretches out infinitely. If you take a step that is too big, you might fly off the edge of the world (the math breaks down).
The Old Way: Previous researchers said, "To stay safe, you must take tiny, tiny steps, or you need to check your path constantly." This made the process very slow.

2. The Solution: The "Smart Step" (Polyak's Stepsize)

The authors introduce a new rule for how big your steps should be. They call it Polyak's Stepsize.

Think of it like driving a car toward a destination you know the exact location of (the "perfect cake").

Old Rule: "Drive at a constant, slow speed just to be safe."
New Rule (Polyak): "Look at how far you are from the destination. If you are far away, floor the gas! If you are close, gently tap the brakes."

The paper proves that if you use this "Smart Step" rule, you can zoom toward the solution without flying off the edge, even on that slippery, infinite surface. It's like having a GPS that automatically adjusts your speed based on how close you are to the goal.

3. The Hidden Superpower: "Implicit Bias"

Here is the most interesting part. The authors discovered that this specific way of walking (Entropic Mirror Descent) has a personality.

The Analogy: Imagine two people walking down a hill to find the lowest point.
- Person A (Standard Gradient Descent): Walks in a straight line. They end up at the lowest point, but they might be carrying a heavy backpack full of unnecessary stuff.
- Person B (Entropic Mirror Descent): Walks in a zig-zag pattern. Because of the way they walk, they naturally drop heavy items along the way. By the time they reach the bottom, they are carrying almost nothing.

In math terms, this "dropping heavy items" means the algorithm naturally finds the sparsest solution (the one with the fewest non-zero numbers). This is called Implicit Bias. The algorithm doesn't need to be told to be simple; it just becomes simple by the nature of how it moves.

4. The "Magic Trick" (Avoiding Exponentials)

The standard method involves a complex math operation called "exponentiation" (raising numbers to powers), which is computationally expensive and slow for computers.

The authors also proposed a backup plan. They found a way to mimic the exact same "walking style" but using simple multiplication and addition instead of the heavy exponentiation.

Analogy: It's like realizing you can get the same delicious cake taste by mixing ingredients in a bowl (simple math) instead of needing a high-tech molecular gastronomy machine (exponentiation). It's faster and easier, but still guarantees you get the right result.

5. The Results: Faster and Smarter

The paper runs simulations (computer experiments) to prove their theory.

Speed: Their "Smart Step" method is significantly faster than the old "tiny steps" or "check constantly" methods.
Reliability: It works even when the puzzle is very hard or the starting point is weird.
Sparsity: It consistently finds the simplest solutions, which is exactly what we want in fields like AI and data science.

Summary

In short, this paper says:

"We found a way to walk down a slippery, infinite hill to find the simplest solution. Instead of taking tiny, cautious steps, we use a 'Smart Step' rule that speeds us up when we are far away and slows us down when we are close. This method naturally leads us to the simplest answer, and we even found a shortcut to do the math faster without losing accuracy."

This is a big deal for anyone building AI, because it means we can solve complex problems faster and get cleaner, simpler results.

Here is a detailed technical summary of the paper "Entropic Mirror Descent for Linear Systems: Polyak's Stepsize and Implicit Bias" by Yura Malitsky and Alexander Posch.

1. Problem Statement

The paper addresses the convergence and implicit bias of Entropic Mirror Descent (EMD) applied to solving linear systems, specifically focusing on the nonnegative linear system $Ax = b$ where $x \in \mathbb{R}^n_+$ .

The Algorithm: The update rule is defined as $x_{k+1} = x_k \circ \exp(-\alpha_k \nabla f(x_k))$ , where $f(x) = \frac{1}{2}\|Ax - b\|^2$ and $\circ$ denotes elementwise multiplication.
The Challenge: Standard convergence analysis for Mirror Descent relies on the strong convexity of the kernel function (negative entropy) or the relative smoothness of the objective. However, the negative entropy kernel is not strongly convex over the entire nonnegative orthant $\mathbb{R}^n_+$ (only over the simplex), and the domain is unbounded.
The Gap: Existing results for EMD convergence typically require restrictive conditions, such as infinitesimal stepsizes (gradient flow), very small fixed stepsizes, or complex backtracking line searches. Furthermore, it was previously unclear if simple, explicit stepsize rules could guarantee convergence without these restrictions.
Motivation: The study is motivated by the connection between EMD and Hadamard overparametrization (minimizing $\frac{1}{2}\|A(u \circ u) - b\|^2$ via gradient descent). Both methods exhibit an "implicit bias" toward $\ell_1$ -sparse solutions when initialized near zero, but rigorous convergence proofs for these schemes have been lacking.

2. Methodology

The authors propose a novel approach combining Entropic Mirror Descent with a Polyak-type adaptive stepsize.

A. The Polyak-Type Step Size

Instead of using a fixed stepsize or backtracking, the authors introduce an adaptive stepsize $\alpha_k$ derived from a quadratic approximation of the exponential map ( $\exp(-t) \approx 1 - t + t^2$ ). The stepsize is defined as:
$\alpha_k = \min \left( \frac{f(x_k)}{\|\nabla f(x_k)\|_{x_k}^2}, \frac{1.79}{\|\nabla f(x_k)\|_\infty} \right)$
where $\|v\|_{x}^2 = \langle x, v^2 \rangle$ is a weighted norm.

The first term resembles the classic Polyak stepsize ( $f(x_k)/\|\nabla f(x_k)\|^2$ ) but adapted for the weighted norm inherent to mirror descent.
The second term ($1.79/|\nabla f(x_k)|_\infty $) ensures the argument of the exponential remains within a range where the quadratic bound$ \exp(t) \leq 1 + t + t^2 $holds (specifically for$ t \leq 1.79$).

B. Theoretical Framework

The convergence proof relies on three main pillars:

Quadratic Approximation: Utilizing the bound $\exp(t) \leq 1 + t + t^2$ for $t \leq 1.79$ to bound the Bregman divergence decrease.
Generalized Pinsker's Inequality: Establishing a lower bound on the Bregman divergence $D_h(x, y)$ in terms of the $\ell_1$ -norm distance, specifically $D_h(x, y) \geq \frac{\|x-y\|_1^2}{2\max(\|x\|_1, \|y\|_1)}$ .
Boundedness: Proving that the iterates remain bounded and the stepsize $\alpha_k$ is bounded away from zero, ensuring the algorithm does not stall.

3. Key Contributions

A. Convergence Guarantees without Restrictive Assumptions

The paper proves that the proposed algorithm converges to a solution $x^* \in S_+$ (the set of nonnegative solutions) with a sublinear convergence rate of $O(1/k)$ for the function values. Crucially, this holds without assuming:

Infinitesimal stepsizes.
Backtracking line searches.
Bounded solution sets.
Strong convexity of the domain.

B. Refined Analysis of Implicit Bias

The authors provide a deeper understanding of why EMD converges to sparse solutions:

Slow Rates (Fixed Initialization): They derive explicit bounds for the $\ell_1$ -gap ( $\|x^*\|_1 - \|z\|_1$ ) when initialized at $x_0 = e^{-\eta \mathbf{1}}$ . They refine existing bounds using the Lambert-W function, showing the gap scales as $O(1/\eta)$ .
Fast Rates (Vanishing Initialization): They demonstrate that as the initialization parameter $\eta \to \infty$ (i.e., $x_0 \to 0$ ), the convergence to the sparsest solution becomes exponential (linear convergence in the log-scale), provided the solution is strictly positive. This explains the practical observation of high sparsity with small initializations.

C. Generalizations

The framework is extended to:

General Linear Systems ( $Ax=b, x \in \mathbb{R}^n$ ): By decomposing $x = u - v$ with $u, v \geq 0$ , the method (EG± algorithm) converges to a solution of the general system.
Hadamard Descent+ (Avoiding Exponentiation): An alternative algorithm $x_{k+1} = x_k \circ (1 - \alpha_k \nabla f(x_k) + \alpha_k^2 \nabla f(x_k)^2)$ is proposed. This mimics the behavior of EMD but avoids computationally expensive exponentiation, with provable convergence guarantees.
Arbitrary Convex L-Smooth Functions: The stepsize rule is generalized to minimize any convex $L$ -smooth function with a known optimal value $f^*$ .

4. Results

Theoretical Rates:
- Sublinear: $f(x_k) - f^* = O(1/k)$ for general convex settings.
- Linear: If the solution $z$ is strictly separated from the boundary ( $z_{\min} > 0$ ), the algorithm exhibits global and local linear convergence.
Numerical Experiments:
- Comparison: The proposed MD-Polyak (Mirror Descent with Polyak stepsize) significantly outperforms Mirror Descent with optimal constant stepsizes and backtracking line searches in terms of iteration count and convergence speed.
- Initialization Sensitivity: Experiments show that for sparse solutions, initializing closer to zero ($10^{-32}$) eventually leads to better convergence despite a slower start, whereas for dense solutions, larger initializations converge faster. This aligns with the theoretical bias toward sparsity.
- Hadamard Descent+: The variant avoiding exponentiation performs comparably to EMD, validating its utility for large-scale problems.

5. Significance

Bridging Theory and Practice: The paper resolves the paradox where the simplest setting (quadratic objective + entropy kernel) lacked a simple convergence proof. It provides a practical, explicit stepsize rule that works well in practice and is theoretically sound.
Understanding Implicit Bias: It offers a rigorous mathematical explanation for the "sparsity-inducing" nature of overparametrized gradient descent and mirror descent, linking initialization magnitude to the degree of sparsity in the solution.
Computational Efficiency: By proposing the "Hadamard Descent+" variant, the paper offers a method that retains the theoretical benefits of EMD (implicit bias, convergence) while removing the computational bottleneck of matrix/vector exponentiation, making it suitable for large-scale applications.
Robustness: The method does not require knowledge of the solution set geometry or boundedness, making it applicable to a wider range of underdetermined linear systems than previous methods.

In summary, this work establishes a robust theoretical foundation for Entropic Mirror Descent in linear systems, introduces a highly effective adaptive stepsize, and deepens the understanding of how optimization algorithms implicitly select sparse solutions.