Last-Iterate Convergence of Randomized Kaczmarz and SGD… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to find the perfect spot to park your car in a massive, crowded parking lot. You can't see the whole lot at once; you can only look at one row at a time.

This is the problem Stochastic Gradient Descent (SGD) solves. It's the engine behind how computers learn from data (like training AI to recognize cats or solve math problems). Instead of looking at the entire dataset to make a move, it takes a tiny, random peek, makes a guess, and adjusts its position.

The specific algorithm this paper focuses on is called Randomized Kaczmarz. Think of it as a game of "Hot and Cold" where you are trying to solve a giant puzzle of linear equations. You pick one clue (equation) at random, adjust your guess to satisfy that clue, and then move on to the next random clue.

The Big Question: "Last-Iterate" vs. "Average"

For decades, mathematicians have known that if you take the average of all your guesses over time, you will eventually get very close to the perfect answer. It's like saying, "If I take 1,000 guesses at the parking spot and average them out, I'll be right in the middle of the spot."

But in the real world, we don't want to wait until the end to average everything out. We want to know: Is the very last guess I made (the "last iterate") actually good?

Imagine you are walking down a hallway trying to find a specific door.

The Old Way (Average): You walk back and forth, leaving a trail of footprints. At the end, you measure the center of all your footprints. That's where the door is.
The New Way (Last Iterate): You want to know if the spot where you are standing right now (after your last step) is already close to the door.

For a long time, we didn't know if the "Last Step" was good enough when using a specific, aggressive strategy called the "Greedy Step Size." This is like taking the biggest possible step you can without overshooting the target. It's the most efficient way to move, but it's also the riskiest. Previous research suggested that if you took this big step, your last guess might be a bit shaky, only getting better at a slow pace (like $1/\sqrt{t}$ ).

The Breakthrough: A Faster Finish Line

The authors of this paper, Michał Dereziński and Xiaoyu Dong, proved that you don't need to average your guesses. If you use this "Greedy Step Size," your very last guess is actually much better than we thought.

They showed that the error shrinks at a rate of $1/t^{3/4}$ .

To put that in perspective:

If the old rate was like walking up a gentle hill, the new rate is like walking up a steeper, faster hill.
If you double the number of steps you take, your accuracy improves significantly more than before.

The Secret Sauce: The "Stochastic Contraction Process"

How did they figure this out? They invented a new way of looking at the problem, which they call a Stochastic Contraction Process.

The Analogy of the Stretchy Rubber Sheet:
Imagine your current guess is a point on a giant, stretchy rubber sheet. Every time you pick a random clue (equation), you pull the sheet in a specific direction to snap your point closer to the truth.

Sometimes you pull hard.
Sometimes you pull gently.
Sometimes the pull is in a direction that makes the point wobble a bit before settling.

The authors realized that instead of tracking the messy, random wobbles of the point, they could track the shape of the rubber sheet itself. They turned this chaotic, random process into a deterministic equation (a predictable math formula).

They found that the "wobbles" of the rubber sheet actually follow a hidden rhythm. Some parts of the sheet oscillate wildly (like a guitar string being plucked), while others move smoothly. By mathematically "unifying" these two behaviors, they could predict exactly how fast the point would settle down.

The "Discrete-to-Continuous" Magic Trick

The hardest part of their proof was bridging the gap between discrete steps (taking one step at a time, like counting 1, 2, 3) and continuous flow (like water flowing in a river).

Think of it like watching a movie.

Discrete: You see individual frames (1, 2, 3...).
Continuous: You see the smooth motion of the actor.

The authors developed a clever mathematical trick to turn their "frame-by-frame" analysis into a smooth "movie." They translated their problem into a Differential Equation (the math used to describe how things change smoothly over time, like the speed of a car). By solving this smooth equation, they could prove exactly how fast the "last step" converges.

Why Does This Matter?

It's Faster: For problems like solving massive systems of equations (used in engineering, physics, and AI), this means we can stop the algorithm sooner and still get a great answer. We don't need to run it as long.
It's More Realistic: In real-world machine learning, we often use the "last guess" (the final model) rather than an average. This paper proves that the "greedy" approach we use in practice is actually mathematically sound and very efficient.
It Solves a Mystery: It answers a question that has puzzled researchers for years: "Is the last step of the Kaczmarz algorithm actually good?" The answer is a resounding yes, and it's better than anyone expected.

Summary

The paper takes a classic, slightly chaotic method for solving math problems (Randomized Kaczmarz) and proves that if you take big, bold steps, your final answer is incredibly accurate. They did this by inventing a new mathematical lens that turns random chaos into a predictable pattern, showing that the "last step" is not just a guess, but a highly optimized solution.

1. Problem Statement

The paper addresses a fundamental open question in optimization theory: the last-iterate convergence rate of Stochastic Gradient Descent (SGD) with a greedy step size (specifically $\eta = 1/\beta$ ) in the interpolation regime for smooth quadratic functions.

Context: The interpolation regime assumes the existence of a common minimizer for all component functions (e.g., consistent linear systems). This setting is crucial for understanding modern over-parameterized deep learning models and classical solvers like the Randomized Kaczmarz (RK) algorithm.
The Gap: While the convergence of averaged iterates in this setting is well-understood ( $O(1/t)$ $O (1/ t)$ ), the behavior of the last iterate with the canonical "greedy" step size ( $\eta = 1/\beta$ $η = 1/ β$ ) has been elusive.
- Previous work by Attia et al. (2025) established an $O(1/t^{1/2})$ rate.
- It was an open question whether this rate was optimal or if a faster rate could be achieved without shrinking the step size below $1/\beta$ .
Specific Challenge: The greedy step size $\eta = 1/\beta$ is empirically the most effective but theoretically difficult to analyze because the contraction operators involved can range from zero to the identity matrix, lacking strict bounds away from these extremes.

2. Methodology

The authors introduce a novel framework based on Stochastic Contraction Processes to analyze the convergence.

A. Stochastic Contraction Process

They define a process $\{\Delta_t\}$ where $\Delta_{t+1} = (I - M_t)\Delta_t$ . Here, $M_t$ are independent random positive semidefinite (PSD) contraction matrices ( $0 \preceq M_t \preceq I$ ) with a common mean $\bar{M} = \mathbb{E}[M_t]$ .

This abstraction captures SGD with greedy step sizes and Randomized Kaczmarz without imposing restrictive bounds on $M_t$ (e.g., requiring $M_t \succeq cI$ ).

B. Reduction to Deterministic Matrix Recursion

The core technical innovation is characterizing the expected norm of the stochastic process via a deterministic matrix recursion.

Lemma 10: They prove that $\mathbb{E}[\|\Delta_t\|_{\bar{M}}^2] \leq \|\Delta_0\|_{N_t}^2$ , where $N_t$ follows the recursion:
$N_0 = \bar{M}, \quad N_{t+1} = N_t(I - 2\bar{M}) + \|N_t\| \cdot \bar{M}$
This reduces the stochastic problem to analyzing the spectral evolution of a deterministic sequence of matrices.

C. Spectral Analysis and Regimes

By analyzing the eigenvalues $\lambda_{k,t}$ of $N_t$ , the authors identify two distinct regimes based on the eigenvalues $\rho_k$ of $\bar{M}$ :

$\rho_k \leq 1/2$ : The eigenvalues follow a smooth trajectory.
$\rho_k > 1/2$ : The term $(1-2\rho_k)$ becomes negative, causing the eigenvalues to oscillate wildly between even and odd iterations.

The proof unifies these regimes by reducing the analysis to a single summation bound involving a parameter $\alpha$ .

D. Discrete-to-Continuous Reduction (ODE Analysis)

The most delicate part of the proof (Section 4) involves bounding a specific summation:
$\rho(1-2\rho)^t + K\rho \sum_{i=1}^t \frac{(1-2\rho)^{t-i}}{i^\alpha} \leq \frac{K}{(t+2)^\alpha}$

To prove this, the authors perform a discrete-to-continuous reduction, approximating the sum with an integral.
This leads to the analysis of a function $L_\alpha(\theta)$ defined by an integral, which satisfies a specific Ordinary Differential Equation (ODE):
$L'_\alpha(\theta) = 1 - \left(2 - \frac{\alpha}{\theta}\right)L_\alpha(\theta)$
Using a "one-point criterion" based on the ODE properties, they establish the bound for $\alpha = 3/4 + 0.001$ .

3. Key Contributions and Results

Main Theoretical Result

The paper proves that for SGD over $\beta$ -smooth quadratics in the interpolation regime with step size $\eta = 1/\beta$ :
$\mathbb{E}[f(x_t) - f(x^*)] = O\left(\frac{1}{t^{3/4 + \theta}}\right)$
where $\theta \geq 0.001$ .

Improvement: This improves upon the previous best-known bound of $O(1/t^{1/2})$ by Attia et al. (2025).
Optimality Barrier: The authors show that their analysis framework hits a fundamental barrier around $3/4 + 0.003$ . They construct a lower bound (Theorem 12) demonstrating that the exponent cannot be improved beyond $3/4 + 0.003$ within this specific matrix recursion framework.

Specific Algorithmic Implications

Randomized Kaczmarz (RK):
- For a linear system $Ax=b$, RK with importance sampling converges as:
  $\mathbb{E}[\|Ax_t - b\|^2] = O\left(\frac{\|A\|_F^2 \|x_0 - x^*\|^2}{t^{3/4+\theta}}\right)$
- This is the first worst-case, condition-number-free last-iterate guarantee for RK that improves on $O(1/t^{1/2})$ .
Randomized Coordinate Descent (RCD):
- Similar $O(1/t^{3/4+\theta})$ rates are established for RCD on PSD systems.
Block Kaczmarz with Preprocessing:
- The authors show that Block Kaczmarz, when preprocessed with a Randomized Hadamard Transform (RHT), achieves a stronger bound.
- If the block size is proportional to the stable rank of $A$ , the convergence rate becomes:
  $O\left(\frac{\|A\|^2 \|x_0 - x^*\|^2}{t^{3/4+\theta}}\right)$
- This replaces the Frobenius norm $\|A\|_F^2$ with the spectral norm $\|A\|^2$ , matching the dependence of Full Gradient Descent up to the convergence exponent.

4. Significance and Impact

Closing the Theory-Practice Gap: The "greedy" step size ( $\eta = 1/\beta$ ) is widely used in practice (e.g., in RK and deep learning) because it is empirically effective, yet theoretical guarantees were lagging. This work provides a rigorous justification for its use in the interpolation regime.
New Analytical Framework: The introduction of Stochastic Contraction Processes and the reduction to deterministic matrix recursions offers a powerful new tool for analyzing iterative solvers where standard averaging arguments fail.
Catastrophic Forgetting: The results have direct implications for Continual Learning. Since SGD with greedy step sizes is used to model continual learning in realizable settings, the improved convergence rates suggest better bounds on catastrophic forgetting in linear regression tasks.
Limitations and Future Directions: The paper identifies a "fundamental barrier" at $\approx 3/4 + 0.003$ for the current proof technique. Determining if the true optimal rate is $O(1/t)$ , $O(1/t^{0.9})$ , or something else remains an open problem, but this work significantly narrows the gap between the known lower bounds and the achievable upper bounds.

In summary, this paper provides a breakthrough in understanding the last-iterate behavior of classical and modern stochastic solvers, moving the theoretical convergence rate from $O(1/t^{1/2})$ to nearly $O(1/t^{3/4})$ for the most practically relevant step size choice.

Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size