Random Scaling and Momentum for Non-smooth Non-convex Optimization

Imagine you are trying to find the lowest point in a vast, foggy, and incredibly rugged mountain range. This mountain range represents the "loss function" of a neural network. Your goal is to get to the bottom (the best possible model) as quickly as possible.

In the past, scientists assumed this mountain was smooth, like a gentle hill. They had a perfect map and a reliable compass (algorithms like Stochastic Gradient Descent with Momentum, or SGDM) that worked great on smooth hills. But in modern AI, the terrain is actually jagged, full of cliffs, sharp rocks, and sudden drops (non-smooth and non-convex). The old compass often gets stuck or spins wildly because the rules of "smoothness" no longer apply.

This paper introduces a clever, slightly magical tweak to the compass that allows it to navigate this jagged terrain perfectly, without needing to change the fundamental way we walk.

Here is the breakdown of their solution using simple analogies:

1. The Problem: The "Jagged" Mountain

Most AI models today use parts like ReLU (which acts like a switch that turns off) or quantization (which snaps numbers to specific steps). These create "jagged" edges in the landscape.

The Old Way: Traditional math says, "If the ground is bumpy, you can't guarantee you'll find the bottom." You might get stuck in a tiny valley that isn't the lowest point.
The New Goal: Instead of demanding a perfect "flat" bottom, the authors ask for a "good enough" spot where the ground isn't too steep in any direction, even if the ground itself is bumpy. They call this a $(c, \epsilon)$ -stationary point. Think of it as finding a spot where, if you look around in a small circle, you aren't going to fall off a cliff.

2. The Secret Sauce: The "Magic Dice" (Random Scaling)

The authors found that the standard way of walking (SGDM) is actually almost perfect, but it needs one tiny, weird modification: Every time you take a step, roll a special die to decide how big that step is.

The Standard Step: You calculate a direction to walk and take a step of size $X$ .
The New Step: You calculate the direction, but before you walk, you roll a die that gives you a random number (specifically from an "exponential distribution"). If the die says 0.5, you take half a step. If it says 2.0, you take a double step.
Why it works: It sounds chaotic, but this randomness is actually a superpower. In the jagged terrain, if you always take the exact same step size, you might get stuck on a sharp rock. By randomly varying the step size, you effectively "smooth out" the jagged rocks mathematically. It's like shaking a box of marbles to settle them into the lowest holes; the shaking (randomness) helps them find the bottom better than just rolling them slowly.

3. The Framework: "Exponentiated O2NC"

The authors built a new "translation machine" called Exponentiated O2NC.

The Analogy: Imagine you are a translator trying to convert a book written in "Online Learning" (a game where you play against an opponent who changes the rules every turn) into a book about "Mountain Climbing."
The Old Translator: The previous version of this machine was clunky. It required the climber to stay inside a tiny, safe bubble at all times and check their position constantly. It was slow and rigid.
The New Translator: This new machine is much freer. It lets the climber take big, bold steps when they are far from the bottom, and it uses the "Magic Dice" (random scaling) to handle the bumps. It translates the complex math of the game directly into the simple, familiar steps of the standard SGDM algorithm.

4. The Result: It's Just SGDM (With a Twist)

The most surprising part of the paper is that when they run this new "translation machine" with the simplest possible settings, it produces the exact same algorithm that engineers have been using for years (SGDM), with one tiny exception: the step size is multiplied by that random die roll.

Before: "SGDM is great for smooth hills, but we don't know why it works on jagged mountains."
Now: "SGDM works on jagged mountains because of this hidden randomness. If we just acknowledge the randomness, we can prove mathematically that it will always find the bottom."

5. The Proof: It's the Fastest Possible Way

The authors didn't just say "it works"; they proved it is the theoretical limit of speed.

They showed that no other algorithm can find the bottom of this jagged mountain faster than their method.
They also proved that if the mountain happens to be smooth (like in older problems), their method automatically slows down to the standard, optimal speed for smooth hills. It adapts to the terrain without you having to change the settings.

Summary

The paper is like discovering that the "chaos" in your daily routine (randomly varying your step size) is actually the secret to navigating a difficult, rocky path.

They took the standard, popular tool used by AI engineers (SGDM), added a tiny, random "shake" to the steps, and proved that this tiny shake is the missing key that allows the tool to work perfectly on the messy, non-smooth problems that define modern AI. It turns a heuristic (a guess that works in practice) into a mathematically guaranteed solution.

1. Problem Statement

The paper addresses the fundamental challenge of optimizing non-smooth, non-convex loss functions, which are ubiquitous in modern deep learning due to architectures involving ReLU activations, max-pooling, and quantization.

The Gap: Standard theoretical guarantees for Stochastic Gradient Descent with Momentum (SGDM) rely on assumptions of smoothness (Lipschitz continuous gradients) or convexity. These assumptions are violated in non-smooth settings.
The Difficulty: In non-smooth optimization, finding a standard $\epsilon$ -stationary point (where $\|\nabla F(x)\| \le \epsilon$ ) is often impossible in the worst case.
Existing Solutions & Limitations:
- Goldstein Stationarity: Recent works use $(\delta, \epsilon)$ -Goldstein stationary points, which require the average gradient over a small ball of radius $\delta$ to be small. However, algorithms achieving this often require conservative updates and explicit constraints to keep iterates within a small ball, which contradicts the behavior of practical deep learning optimizers.
- Weak Convexity: Some approaches assume weak convexity, but this is a restrictive assumption not always met in practice.

2. Methodology

The authors propose a new theoretical framework called Exponentiated Online-to-Non-convex Conversion (Exponentiated O2NC). This framework converts Online Convex Optimization (OCO) algorithms into non-smooth non-convex stochastic optimization algorithms.

A. New Convergence Criterion: $(c, \epsilon)$ -Stationarity

The paper introduces a relaxed definition of stationarity to replace the strict Goldstein point:

Definition: A point $x$ is a $(c, \epsilon)$ -stationary point if there exists a distribution $P$ over points $y$ such that $\mathbb{E}[y] = x$ , and the quantity $\|\mathbb{E}[\nabla F(y)]\| + c \cdot \mathbb{E}\|y - x\|^2 \le \epsilon$ .
Significance: Unlike Goldstein stationarity, which requires a hard constraint $\|y-x\| \le \delta$ , this definition relaxes the constraint to an expectation on the squared distance. This allows for larger, more aggressive updates while still maintaining theoretical convergence guarantees.
Reduction: The authors prove that for smooth or second-order smooth objectives, finding a $(c, \epsilon)$ -stationary point implies finding a standard $\epsilon$ -stationary point with optimal rates.

B. The Exponentiated O2NC Framework

The framework modifies the standard O2NC technique (Cutkosky et al., 2023) with two key innovations:

Random Scaling (Exponential Distribution):
- Instead of updating $x_n = x_{n-1} + \Delta_n$ , the algorithm updates $x_n = x_{n-1} + s_n \Delta_n$ , where $s_n \sim \text{Exp}(1)$ (exponentially distributed).
- Theoretical Insight: This random scaling creates a linear relationship: $\mathbb{E}[F(x_n) - F(x_{n-1})] = \mathbb{E}[\langle \nabla F(x_n), x_n - x_{n-1} \rangle]$ . This eliminates the need for Taylor series approximations (which require smoothness) to bound the function decrease, effectively bypassing the non-smoothness barrier.
- Practicality: Since $s_n$ is close to 1 with high probability, the behavior remains similar to standard gradient steps.
Exponentiated and Regularized Losses:
- The OCO algorithm minimizes a loss function $\ell_n(\Delta) = \langle \beta^{-n} g_n, \Delta \rangle + R_n(\Delta)$ .
- Exponentiated Gradients: The gradient $g_n$ is weighted by $\beta^{-n}$ . This allows the algorithm to bound the expected gradient of the non-smooth objective by minimizing the regret of the OCO algorithm against these weighted losses.
- Regularization: A quadratic regularizer $R_n(\Delta) = \frac{\mu_n}{2}\|\Delta\|^2$ is added to control the variance of the iterates, ensuring the distribution of points used for the stationarity check remains tight.

C. Recovery of SGDM

By instantiating the Exponentiated O2NC framework with Online Gradient Descent (OGD) (specifically an unconstrained variant with composite loss), the authors derive an update rule that is mathematically equivalent to SGDM with a single modification:
$x_{t+1} = x_t - s_{t+1} \cdot \tilde{\eta} m_{t+1}$
Where $m_{t+1}$ is the standard momentum term, $\tilde{\eta}$ is the effective learning rate, and $s_{t+1}$ is the exponential random scaling factor.

3. Key Contributions

New Stationarity Notion: Introduced $(c, \epsilon)$ -stationarity, a relaxation of Goldstein stationarity that enables more flexible algorithm design without requiring explicit small-ball constraints.
Exponentiated O2NC Framework: Developed a general conversion technique that removes the need for intermediate states (unlike previous O2NC methods) and uses random scaling to handle non-smoothness without smoothness assumptions.
Theoretical Guarantees for SGDM: Proved that standard SGDM (augmented only with random scaling) achieves the optimal convergence rate for non-smooth non-convex optimization:
- General Non-smooth: $O(c^{1/2} \epsilon^{-7/2})$ iterations.
- Smooth Case: Automatically recovers $O(\epsilon^{-4})$ .
- Second-Order Smooth Case: Automatically recovers $O(\epsilon^{-7/2})$ .
Lower Bound: Established a matching lower bound of $\Omega(F^* G^2 c^{1/2} \epsilon^{-7/2})$ , proving the optimality of the proposed rate.

4. Results

Theoretical: The paper provides a rigorous convergence analysis showing that the proposed algorithm finds a $(c, \epsilon)$ -stationary point within the optimal number of iterations. The analysis demonstrates that the "random scaling" trick is sufficient to bridge the gap between online learning regret bounds and non-smooth optimization convergence.
Empirical: Experiments were conducted on CIFAR-10 using ResNet-18.
- Setup: Compared standard SGDM vs. SGDM with random scaling (same hyperparameters: LR=0.01, Momentum=0.9).
- Outcome: The performance of the modified SGDM was statistically indistinguishable from standard SGDM.
  - Train Loss: $9.55 \pm 0.37 $(Modified) vs.$ 9.82 \pm 0.21$ (Standard).
  - Test Accuracy: $94.4 \pm 0.2% $(Modified) vs.$ 94.6 \pm 0.1%$ (Standard).
- Implication: The theoretical modification (random scaling) does not degrade practical performance, validating the feasibility of the approach.

5. Significance

Bridging Theory and Practice: This work provides the first theoretical justification for why momentum-based methods (SGDM) work effectively on non-smooth, non-convex problems (like deep neural networks), a gap that has existed for decades.
Minimal Modification: It suggests that the standard SGDM algorithm is "almost" optimal for non-smooth problems; it only requires a trivial random scaling factor to achieve rigorous convergence guarantees.
Framework for Future Algorithms: The Exponentiated O2NC framework offers a blueprint for designing new optimizers. The authors note that replacing the OGD subroutine with adaptive methods (like AdaGrad) within this framework could theoretically recover algorithms like Adam, opening new avenues for adaptive non-smooth optimization.
Dimension-Free Rates: The use of randomization (exponential scaling) aligns with recent findings that deterministic algorithms suffer from dimension-dependent lower bounds in non-smooth optimization, whereas randomized approaches can achieve dimension-free rates.

In summary, the paper demonstrates that a simple, theoretically grounded modification to the ubiquitous SGDM algorithm resolves the long-standing theoretical disconnect between momentum methods and non-smooth non-convex optimization, achieving optimal convergence rates without sacrificing practical performance.

Random Scaling and Momentum for Non-smooth Non-convex Optimization

1. The Problem: The "Jagged" Mountain

2. The Secret Sauce: The "Magic Dice" (Random Scaling)

3. The Framework: "Exponentiated O2NC"

4. The Result: It's Just SGDM (With a Twist)

5. The Proof: It's the Fastest Possible Way

Summary

1. Problem Statement

2. Methodology

A. New Convergence Criterion: (c,ϵ)(c, \epsilon)(c,ϵ)-Stationarity

B. The Exponentiated O2NC Framework

C. Recovery of SGDM

3. Key Contributions

4. Results

5. Significance

More like this

Faster Stochastic Algorithms for Minimax Optimization under Polyak--Łojasiewicz Conditions

Tensor Completion Leveraging Graph Information: A Dynamic Regularization Approach with Statistical Guarantees

Federated Multi-Agent Mapping for Planetary Exploration

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

All-in-one foundational models learning across quantum chemical levels

A. New Convergence Criterion: $(c, \epsilon)$ -Stationarity