Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression

Imagine you are trying to find the absolute lowest point in a vast, foggy valley (this is the "loss" in machine learning, representing how wrong your model is). Your goal is to get to the bottom as fast as possible.

For decades, the rule of thumb for finding this bottom was: "Take small, careful steps."

If you took a step that was too big, you might overshoot the bottom, bounce up the other side, and start oscillating wildly, never settling down. This is the "stability" rule. However, in modern AI, people noticed something weird: sometimes, taking huge, reckless steps actually gets you to the bottom faster, even if you wobble a bit at first. But nobody could explain why this worked without the math getting incredibly messy, or without the algorithm crashing.

This paper is like a new map that says: "You don't need to be reckless to be fast. You just need to know how to grow your steps wisely."

Here is the breakdown of their discovery using simple analogies:

1. The Old Way vs. The "Edge of Stability"

The Old Way: You take tiny steps. It's safe, but it takes forever to reach the bottom.
The "Edge of Stability" (The Reckless Way): Recent research found that if you take massive steps, you might jump over the bottom, land on the other side, and then bounce back and forth. Eventually, you settle down, but you had to endure a chaotic, unstable phase first. It's like a surfer trying to ride a wave by jumping off the cliff; it works, but it's dangerous and hard to predict.

2. The New Discovery: The "Growing Stride"

The authors of this paper found a smarter way. They realized you don't need to jump off a cliff. Instead, you can start with a normal stride and slowly, steadily increase your step size as you go.

The Analogy: Imagine you are walking down a hill.
- At the top, the ground is flat and slippery, so you take small steps.
- As you get lower, the hill gets steeper and the path becomes clearer.
- Instead of stopping to measure the slope every time (which is slow), you just decide: "Every time I take a step, I'll make my next step slightly longer."
- The Magic: Because of the specific shape of the "hill" they are studying (Logistic Regression with separable data), this simple rule of "growing steps" keeps you perfectly on track. You never wobble, you never crash, and you reach the bottom exponentially faster than the old "tiny step" method.

3. The Two Main Characters

Character A: Gradient Descent (The Solo Hiker)

This is the standard method where the hiker sees the whole path at once.

The Problem: Previous fast methods required the hiker to jump wildly (unstable) or use complex, custom-made maps (adaptive step sizes).
The Solution: The authors gave the hiker a simple, pre-written schedule: "Step 1: size X. Step 2: size X + a little bit. Step 3: size X + a little more."
The Result: The hiker walks smoothly, never losing balance, but speeds up dramatically as they go. They proved mathematically that this works for any amount of time (it's "anytime"), meaning you don't need to know how long the hike will be beforehand.

Character B: Stochastic Gradient Descent (The Blindfolded Hiker)

This is the method used in real-world AI, where the hiker is blindfolded and can only see one small patch of the ground at a time (random data points).

The Problem: Taking big steps when you are blindfolded is usually a recipe for disaster. You might step off a cliff.
The Solution: The authors created a "smart blindfold." The hiker looks at the ground right in front of them. If the ground looks steep (high error), they take a big step. If it looks flat (low error), they take a smaller step.
The Twist: They proved that even with this randomness, if the hiker follows this specific "look-and-adjust" rule, they will still reach the bottom exponentially fast.
Why it's special: Previous methods required the hiker to stop and ask, "How close are we to the bottom?" (a technique called "line search"). This new method doesn't need to stop and ask; it just keeps moving based on what it sees, making it much faster and simpler.

4. Why This Matters

No More Chaos: You don't need to rely on "unstable" phases where the AI acts crazy before it gets smart. You can get fast results while staying stable.
Simplicity: The rules are simple. You don't need a supercomputer to calculate complex step sizes; you just need a simple formula that grows over time.
Speed: It turns a slow, polynomial crawl (getting slower and slower) into an exponential sprint (getting faster and faster).

The Bottom Line

Think of this paper as discovering a new way to drive a car.

Old Theory: Drive at a constant, safe speed to avoid crashing.
Old "Fast" Theory: Floor the gas pedal, hope you don't crash, and pray you stabilize eventually.
This Paper: "Press the gas pedal gently, but increase the pressure smoothly as the road straightens out. You will arrive at the destination faster than anyone else, and you won't even swerve."

They proved that instability is not a requirement for speed. With the right, simple rhythm, you can go fast and stay safe at the same time.

1. Problem Statement

The paper addresses the optimization of Logistic Regression on linearly separable data. The objective is to minimize the logistic loss function:
$L(w) = \frac{1}{n} \sum_{i=1}^n \ln(1 + \exp(-y_i x_i^\top w))$
where the data $\{(x_i, y_i)\}$ is linearly separable (i.e., there exists a margin $\gamma > 0$ such that $y_i x_i^\top w^\star \geq \gamma$ for all $i$ ).

The Core Challenge:
While Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are standard, their convergence rates for separable logistic regression are traditionally understood to be sub-linear ( $O(1/T)$ ) under standard small step sizes. Recent research suggests that faster convergence (exponential or near-exponential) can be achieved by using large step sizes that push the optimization trajectory into an "Edge of Stability" (EoS) regime. In this regime, the loss oscillates non-monotonically before stabilizing.

Limitation of existing work: Current methods achieving exponential rates rely on these unstable, oscillatory phases or require complex adaptive schedules (e.g., line searches, knowledge of the time horizon $T$ , or target accuracy $\epsilon$ ).
Research Question: Can one achieve exponential convergence for separable logistic regression without entering the unstable "Edge of Stability" regime, using simple, non-adaptive step-size schedules?

2. Methodology

The authors propose a fundamentally different approach that avoids the "Edge of Stability" entirely. Instead of relying on transient instability to accelerate progress, they design step-size schedules that implicitly leverage the local curvature properties of the logistic loss while maintaining global stability (monotonic loss decrease).

Key Technical Insights

Self-Bounded Curvature: Under the separability assumption, the Hessian of the logistic loss is bounded by the loss value itself: $\lambda_{\max}(\nabla^2 L(w)) \leq L(w)$ . This implies that as the loss decreases, the local smoothness constant decreases, allowing for larger step sizes safely.
Stable Regime: The authors prove that if the step size $\eta_t$ satisfies $L(w_t) \leq 1/\eta_t$ , the loss decreases monotonically. They construct step sizes that satisfy this condition by design, avoiding the oscillatory phase seen in previous large-step-size methods.

Proposed Algorithms

A. Gradient Descent (GD)

Step Size Schedule: A non-adaptive, increasing step size determined solely by the initialization and the margin $\gamma$ .
$\eta_t = \begin{cases} \frac{1}{\ln(2) + \|w_0\|} & t=0 \\ \frac{S_{t-1}}{2 \max\{2F(w_0), \ln^2(S_{t-1})\}} & t > 0 \end{cases}$
where $S_t = \gamma^2 \sum_{k=0}^t \eta_k$ and $F(w)$ is a proxy for the exponential loss.
Mechanism: The step size grows cumulatively based on $S_{t-1}$ . The schedule is designed such that the upper bound on the loss derived from the update rule exactly matches $1/\eta_t$ , ensuring the monotonicity condition holds inductively.

B. Stochastic Gradient Descent (SGD)

Step Size Schedule: A lightweight adaptive rule based on the observed stochastic loss of the sampled point.
$\eta_t = \min \left\{ \frac{1}{\epsilon}, \frac{1}{L_{i_t}(w_t)} \right\}$
where $\epsilon$ is the target tolerance (for the "Anytime" version, a block-doubling strategy removes the need for prior $\epsilon$ ).
Mechanism: The step size is inversely proportional to the loss of the sampled example. The authors use a stopping time analysis (conditioning on the event that the target accuracy has not yet been reached) to prove that the algorithm makes sufficient progress without needing line searches.

3. Key Contributions

Anytime Exponential Convergence for GD:
- Proved that GD with a simple, pre-defined, increasing step size achieves an exponential convergence rate of $L(w_t) \leq \exp(-\Omega(t^{1/3}))$ .
- Crucial Distinction: Unlike prior works (e.g., Wu et al., 2024; Zhang et al., 2025), this method never enters an unstable regime. The loss decreases monotonically throughout the entire optimization process.
- The method is "anytime," meaning it does not require prior knowledge of the time horizon $T$ or the target accuracy.
Exponential Convergence for SGD:
- Established the first exponential convergence guarantee for vanilla SGD on separable logistic regression using a simple adaptive step size.
- The method avoids line searches and specialized procedures.
- The analysis corrects technical flaws in recent literature (e.g., Vaswani and Babanezhad, 2025) regarding conditioning on future randomness by using a proper stopping-time filtration argument.
- Achieves a high-probability hitting time bound of $O(\frac{n}{\gamma^2} \ln^2(\frac{n}{\epsilon}))$ .
Block Adaptive SGD (Anytime SGD):
- Proposed a "doubling trick" strategy (Block Adaptive SGD) that removes the requirement for knowing the target tolerance $\epsilon$ in advance.
- The algorithm runs in blocks with progressively decreasing target accuracies, achieving exponential convergence without prior knowledge of the final goal.

4. Results

Theoretical Rates:
- GD: $L(w_t) \leq \frac{C t^{2/3}}{\exp(c t^{1/3})} = \exp(-\Omega(t^{1/3}))$ . This is strictly faster than the $O(1/T^2)$ rates of previous large-step-size methods.
- SGD: Expected hitting time to $\epsilon$ -accuracy is $O(\frac{n}{\gamma^2} \ln^2(\frac{n}{\epsilon}))$ .
Empirical Validation:
- Experiments on synthetic linearly separable datasets and real-world MNIST subsets confirm the theoretical predictions.
- GD: The loss decreases monotonically, and the plot of $\ln(S_t)$ vs. $t^{1/3}$ shows a linear trend, validating the growth rate.
- SGD: The average loss exhibits a sharp decrease, showing a linear trend in log-scale against $\sqrt{t}$ , confirming near-exponential convergence.
- Comparisons show the proposed GD significantly outperforms constant step-size GD and avoids the oscillations seen in "Edge of Stability" methods.

5. Significance and Impact

Decoupling Acceleration from Instability: The most significant contribution is the demonstration that instability is not a prerequisite for acceleration. Previous theories suggested that fast convergence required passing through an "Edge of Stability" phase. This paper proves that carefully structured, simple step-size growth is sufficient to achieve exponential rates while maintaining global stability.
Simplicity and Practicality: The proposed step-size schedules are simple, non-adaptive (for GD), or require minimal adaptation (for SGD), making them highly practical for implementation compared to complex line-search or curvature-estimation methods.
Theoretical Rigor: The paper provides a rigorous framework for analyzing SGD with large, adaptive step sizes, correcting previous technical errors in the literature regarding probabilistic conditioning.
Broader Implications: The "self-bounded gradient" property utilized here may inspire similar stable, fast-converging algorithms for other smooth convex losses with exponential tails, potentially extending beyond logistic regression.

In summary, this paper resolves a theoretical gap by showing that simple, structured step-size growth can unlock exponential convergence for separable logistic regression in both deterministic and stochastic settings, all while keeping the optimization trajectory stable and monotonic.

Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression

1. The Old Way vs. The "Edge of Stability"

2. The New Discovery: The "Growing Stride"

3. The Two Main Characters

Character A: Gradient Descent (The Solo Hiker)

Character B: Stochastic Gradient Descent (The Blindfolded Hiker)

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

Key Technical Insights

Proposed Algorithms

3. Key Contributions

4. Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank