A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization

Imagine you are trying to find the absolute lowest point in a vast, foggy valley (this is your optimization problem). You want to reach the bottom as quickly as possible to solve a complex data puzzle, like predicting house prices or diagnosing diseases.

The valley has two parts:

The Smooth Slope: A gentle, predictable hillside (this is the smooth function).
The Rocky Patch: A jagged, bumpy area with hidden traps (this is the non-smooth function, often used to keep the solution simple or sparse).

The paper introduces a new method called PSGA (Proximal Stochastic Gradient Algorithm with Adaptive Step Size and Variance Reduction) to help you navigate this valley. Here is how it works, explained through simple analogies:

1. The Problem with the Old Way (The "Random Walker")

Traditionally, people used a method called Stochastic Gradient Descent (SGD). Imagine you are blindfolded and trying to find the bottom of the valley. You take a step, feel the ground, and guess which way is down.

The Issue: Because you are only feeling a tiny patch of ground at a time (random sampling), your guess is often wrong. You might step left when you should have stepped right. This "noise" or variance makes you zigzag wildly, taking a very long time to reach the bottom.

2. The "Variance Reduction" Trick (The "Memory Keeper")

To fix the zigzagging, previous methods tried to remember the whole map.

The Old Fix: Some methods (like SAGA) tried to carry a giant notebook with the gradient (slope) of every single point in the valley.
The Problem: If the valley is the size of a continent (big data), carrying that notebook is impossible. It's too heavy and takes up too much memory.
The PSGA Solution: Instead of carrying the whole notebook, PSGA uses a clever "smart memory." It remembers a few key recent steps and uses them to correct your current guess. It's like having a GPS that updates your position based on your last few moves, rather than needing a map of the entire world. This keeps you moving straighter without the heavy baggage.

3. The "Adaptive Step Size" (The "Smart Pacer")

This is the paper's biggest innovation.

The Old Way: Imagine a runner who must decide their stride length before the race starts.
- If they pick a long stride, they might trip over a rock (diverge) if the ground gets tricky.
- If they pick a short stride, they will never finish the race because they are moving too slowly.
- Most old algorithms forced you to pick a fixed stride or slowly shrink it, which is inefficient.
The PSGA Way: PSGA is like a runner with smart shoes that adjust their stride in real-time.
- If the ground is smooth and safe: The shoes say, "Great! Take a big, confident step!" (Increasing the step size).
- If the ground is bumpy or you are wobbling: The shoes say, "Whoa, slow down! Take a smaller, safer step." (Decreasing the step size).
- This prevents the algorithm from crashing (diverging) while ensuring it doesn't crawl. It finds the "Goldilocks" stride instantly.

4. The Result: Faster and Smarter

The authors tested this new method on two famous challenges:

Logistic Regression: Like sorting emails into "Spam" or "Not Spam."
Lasso Regression: Like picking the most important ingredients for a recipe while ignoring the rest.

The Outcome:
In their experiments, PSGA was like a Formula 1 car compared to the other methods which were like bicycles.

It reached the solution (the bottom of the valley) much faster.
It used less computer memory (it didn't need the giant notebook).
It handled the "rocky patches" (non-smooth parts) perfectly.
Most importantly, it proved mathematically that even if the valley isn't perfectly shaped (not "strongly convex"), this method will still find the best solution without getting lost.

Summary

The paper presents a new algorithm that combines smart memory (to stop zigzagging) with self-adjusting steps (to run fast but stay safe). It allows computers to solve massive data problems much faster and more efficiently than before, without needing to store huge amounts of data in memory. It's the difference between stumbling through the fog and gliding straight to the finish line.

Here is a detailed technical summary of the paper "A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization" by Fang, Yang, and Chen.

1. Problem Formulation

The paper addresses the composite convex optimization problem, widely used in machine learning (e.g., Logistic and Lasso regression), signal processing, and statistical modeling. The objective is to minimize:
$\min_{x \in \mathbb{R}^n} F(x) = f(x) + r(x)$
where:

$f(x) = \mathbb{E}_{\xi \sim P}[\Lambda(x; \xi)]$ is a smooth convex function (often an expectation over a large dataset).
$r(x)$ is a non-smooth convex regularization term (e.g., $\ell_1$ -norm).

Challenges:

Large-scale data: Computing the full gradient $\nabla f(x)$ is computationally expensive.
Variance: Standard Stochastic Gradient Descent (SGD) uses random sampling, leading to high variance and slow convergence.
Step Size Sensitivity: Existing variance-reduced methods (like ProxSVRG, SAGA) often require fixed or diminishing step sizes, or assume strong convexity, which limits their applicability to general convex problems.

2. Methodology: The PSGA Algorithm

The authors propose the Proximal Stochastic Gradient Algorithm (PSGA), which integrates variance reduction, proximal mapping, and an adaptive step-size strategy based on the Barzilai-Borwein (BB) method.

Key Components:

Variance Reduction:
Instead of computing full gradients every epoch (like SVRG) or storing all historical gradients (like SAGA), PSGA uses a recursive estimator. With probability $1/m$, it computes a fresh gradient; otherwise, it updates the gradient estimate using the difference between current and previous mini-batch gradients:
$\tilde{\nabla}f(x_k) = \begin{cases} \mu_k & \text{with prob } 1/m \\ \mu_k + (1-\theta_k)(\tilde{\nabla}f(x_{k-1}) - \nu_k) & \text{with prob } 1-1/m \end{cases}$
where $\mu_k$ and $\nu_k$ are mini-batch gradients at $x_k$ and $x_{k-1}$ .
Adaptive Step Size (BB-based):
The paper introduces a novel adaptive step size $\eta_k$ derived from the BB2 step size (short step size) but with a stabilization mechanism to prevent divergence in general convex settings.
- It calculates a ratio $\tau_k = \frac{\langle \mu_k - \nu_k, x_k - x_{k-1} \rangle}{\|\mu_k - \nu_k\|^2}$ .
- Update Rule:
  - If $\tau_k \ge \eta_{k-1}$ : Increase step size ( $\eta_k = (1 + 1/\tau_k)\eta_{k-1}$ ).
  - If $\eta_{k-1}/2 < \tau_k < \eta_{k-1}$ : Set $\eta_k = \tau_k$ .
  - If $\tau_k \le \eta_{k-1}/2$ : Decrease step size ( $\eta_k = \eta_{k-1}/\sqrt{2}$ ).
- This strategy avoids the need for line searches and prevents the step size from becoming too aggressive (divergence) or too small (slow convergence).
Proximal Update:
The algorithm performs a proximal update:
$y_k = \text{prox}_{\eta_k D} (x_k - \eta_k \tilde{\nabla}f(x_k))$
$x_{k+1} = x_k + \delta_k \theta_k (y_k - x_k)$
where $D$ is a surrogate function for $r(x)$ .

3. Key Contributions

The authors make four primary theoretical and practical contributions:

Relaxed Convexity Assumption: Unlike previous works (e.g., S-PStorm, SVRG-BB) that require the objective function to be strongly convex, PSGA only requires convexity. This significantly broadens the applicability of the method.
Adaptive Step Size without Storage: The method avoids the high memory cost of SAGA (which stores $N \times n$ gradients) and the full gradient computation cost of SVRG per epoch. The step size is adaptive, removing the need for manual tuning or fixed schedules.
Strong Convergence and Error Analysis:
- Proves that the gradient estimation error converges to zero almost surely ( $\lim_{k \to \infty} \|\tilde{\nabla}f(x_k) - \nabla f(x_k)\| = 0$ a.s.).
- Establishes strong convergence of the sequence $\{x_k\}$ to an optimal point.
Improved Convergence Rate:
- Achieves a convergence rate of $O(\sqrt{1/k})$ for the expected distance to the optimal set.
- This improves upon the $O(\sqrt{\log k / k})$ rate of the S-PStorm method [12].

4. Experimental Results

The authors validated PSGA on Logistic Regression and Lasso Regression using standard datasets (a9a, covtype, phishing, rcv1, real-sim, news20, w8a) from LIBSVM.

Comparison: PSGA was compared against S-PStorm, SAGA, RDA, Prox-SVRG, and PStorm.
Performance Metrics:
- Convergence Speed: PSGA consistently reached the optimal objective value ( $f^*$ ) faster in terms of both iterations and CPU time.
- Gradient Accuracy: PSGA demonstrated lower gradient estimation errors compared to competitors, particularly on large datasets like rcv1 and news20.
- Memory Efficiency: On large datasets (news20, real-sim), SAGA failed to run due to Out of Memory (OOM) errors caused by its gradient storage requirement, whereas PSGA ran successfully.
Specific Findings:
- On the a9a dataset, PSGA converged in 6 iterations (1.27s) compared to S-PStorm's 217 iterations (21.90s).
- On news20, PSGA achieved a better objective value (0.2724) in 162 iterations (5327s) compared to S-PStorm (0.2729 in 982 iterations, 35990s).

5. Significance

This paper presents a robust solution for large-scale composite optimization problems where:

Strong convexity cannot be assumed (a common limitation in real-world data).
Memory is constrained (making SAGA infeasible).
Step size tuning is difficult (making fixed-step methods inefficient).

By combining variance reduction with a stabilized, adaptive Barzilai-Borwein step size, PSGA offers a parameter-free, memory-efficient, and theoretically guaranteed method that outperforms state-of-the-art algorithms in both speed and accuracy for general convex problems.

A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization

1. The Problem with the Old Way (The "Random Walker")

2. The "Variance Reduction" Trick (The "Memory Keeper")

3. The "Adaptive Step Size" (The "Smart Pacer")

4. The Result: Faster and Smarter

Summary

1. Problem Formulation

2. Methodology: The PSGA Algorithm

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Partial Sums of the Series for the Dirichlet Eta Function, their Peculiar Convergence, the Simple Zeros Conjecture, and the RH

Triangular arrangements on the projective plane

Some arithmetic properties of Weil polynomials of the form t2g+atg+qgt^{2g}+at^g+q^gt2g+atg+qg

Big Picard theorems and algebraic hyperbolicity for varieties admitting a variation of Hodge structures

On the dual positive cones and the algebraicity of a compact Kähler manifold

Some arithmetic properties of Weil polynomials of the form $t^{2g}+at^g+q^g$