Convex and Non-convex Federated Learning with Stale Stochastic Gradients: Diminishing Step Size is All You Need

Imagine a massive, global cooking competition where a head chef (the Central Server) wants to create the perfect dish. However, the chef doesn't have all the ingredients or the skills to do it alone. Instead, they have n assistants (the Agents) scattered all over the world, each with their own unique pantry and cooking style.

The goal is for everyone to work together to minimize the "badness" of the final dish (the Global Objective).

Here is the problem: The assistants are far away. They send their advice back to the chef, but the advice is often:

Noisy: They might guess the taste based on a tiny sample (Stochastic).
Biased: They might be using a weird measuring cup that always adds a little too much salt (Biased).
Late: Because of bad internet or traffic, the advice the chef receives today was actually written down yesterday, or even last week (Stale/Delayed).

The Old Way vs. The New Way

The Old Way (Adaptive Step Sizes):
In the past, when the chef received late or messy advice, they thought, "Oh no, this is tricky! I need to be super smart. I need to constantly change my cooking speed based on how late the advice is." If the advice was very old, they would slow down. If it was fresh, they would speed up. This required a complex, constantly adjusting algorithm.

The New Way (This Paper's Discovery):
The authors of this paper, Xinran Zheng, Tara Javidi, and Behrouz Touri, discovered something surprisingly simple: You don't need to be that smart.

They found that the chef doesn't need to constantly adjust their speed based on the delays. Instead, they just need to follow a simple rule: "Start fast, but slow down gradually over time."

Think of it like driving a car down a foggy, winding road where the GPS signal is lagging.

The Old Idea: "The GPS is lagging by 5 seconds! I must brake hard! Now it's lagging by 2 seconds, I can accelerate!" (This is the Adaptive Step Size).
The New Idea: "I'll just drive fast at first, but I'll slowly take my foot off the gas as I get closer to the destination, regardless of the GPS lag." (This is the Diminishing Step Size).

The "Secret Sauce": The Diminishing Step Size

The paper proves that if you simply start with a big step and make your steps smaller and smaller as time goes on (mathematically, a "diminishing step size"), you will reach the perfect dish just as fast as the complex, adaptive method.

Why does this work?

Early on: You need big steps to get moving quickly. The delays don't matter much because you are far from the solution anyway.
Later on: As you get close to the perfect dish, you need to be precise. By making your steps tiny, you stop overshooting the target, even if the advice you are using is slightly old or slightly wrong.

The Three Scenarios They Tested

The authors tested this "slow down gradually" rule on three types of cooking challenges:

The "Messy Kitchen" (Non-Convex):
Imagine a kitchen with many hidden traps and dead ends. The goal is just to find any spot where the food tastes good (a local peak).
- Result: The simple "slow down" rule works just as well as the complex adaptive rule. It finds a good spot efficiently.
The "Perfect Bowl" (Strongly Convex):
Imagine a smooth, bowl-shaped valley where there is only one perfect bottom point.
- Result: The simple rule finds the bottom of the bowl at the fastest possible speed known to science, even with the delays and bad data.
The "Flat Plateau" (General Convex):
Imagine a wide, flat area where the ground is mostly level, but you want to find the absolute lowest point.
- Result: The simple rule gets you there almost as fast as the complex adaptive method. It might be a tiny bit slower (like a logarithmic factor, which is math-speak for "a little bit of extra time"), but it's practically the same.

Why This Matters

In the real world of Federated Learning (like training AI on your phone without sending your photos to a server), data is messy, devices are slow, and connections drop.

Before: Engineers had to build complex systems to detect delays and adjust learning rates on the fly. This is hard to code and hard to maintain.
Now: This paper says, "Just use a simple timer that tells the system to slow down over time."

The Takeaway:
You don't need a fancy, adaptive GPS to navigate a foggy road with laggy signals. You just need to know that patience pays off. If you start strong and gradually slow down, you will arrive at the destination just as effectively as the person who is frantically checking their watch and adjusting their speed every second.

In short: "Diminishing step size is all you need."

1. Problem Formulation

The paper addresses the challenge of distributed stochastic optimization in a federated learning setting where communication and computation are imperfect. The goal is to minimize a global objective function $f(x) = \sum_{i=1}^n f_i(x)$ , where $f_i$ is a local cost function accessible only to agent $i$ .

The setting involves three critical practical complications often ignored or simplified in prior theoretical works:

Stochastic and Biased Gradients: Agents transmit gradient estimates that are not only stochastic (due to data sampling) but potentially biased. This is common in zeroth-order optimization (using random perturbations) or compressed communication.
Communication Delays (Stale Gradients): Due to stragglers or asynchronous updates, the central server receives gradient information computed at earlier time steps $\tau_i(t) < t$ .
Constraints: The optimization variable $x$ must lie within a closed convex set $S$ , requiring projected gradient descent.

The Delay Model:
The authors introduce a "scaled-delay" model, which is weaker than the standard "bounded-delay" assumption (where delays are bounded by a constant $D$ ).

Scaled Delay: The delay satisfies $t - \tau_i(t) \leq (1-\kappa)t$ for some $\kappa \in (0,1)$ . This implies that the delay can grow linearly with time, but the gradient used at time $t$ is guaranteed to be computed no earlier than time $\kappa t$ .
Second Moment: The delay has a bounded second moment ( $E[(t-\tau_i(t))^2] \leq C$ ).

2. Methodology

The authors propose a general framework for Projected Stochastic Gradient Descent (PSGD) under these conditions.

Algorithm:

Initialization: The server starts with $x(0) \in S$ .
Broadcast: At time $t$ , the server sends $x(t)$ to all $n$ agents.
Local Computation: Agents compute (or estimate) gradients based on local data. These estimates may be biased and delayed.
Aggregation: The server aggregates the received gradients $g_i(x(\tau_i(t)), \xi(\tau_i(t)))$ into a global gradient estimate:
$g(t) = \sum_{i=1}^n g_i(x(\tau_i(t)), \xi(\tau_i(t)))$
Update: The server updates the global variable using a pre-chosen diminishing step size $\eta(t)$ :
$x(t+1) = \Pi_S [x(t) - \eta(t)g(t)]$
where $\Pi_S$ is the projection operator onto set $S$ .

Key Analytical Approach:
Instead of designing complex delay-adaptive step sizes (which adjust $\eta(t)$ dynamically based on observed delays, as done in prior work like Sra et al., 2016), the authors analyze the convergence using a standard, pre-determined diminishing step size sequence (e.g., $\eta(t) \propto 1/t^\alpha$ ).

They derive a key lemma bounding the expected distance between the current iterate $x(t)$ and the stale iterate $x(\tau_i(t))$ used in the gradient calculation. This bound relies on the scaled-delay assumption and the bounded second moment of the gradient estimators.

3. Key Contributions

Unified Framework: The paper provides the first analysis of constrained SGD with biased stochastic gradients under a scaled-delay model (where delays can grow with time).
Sufficiency of Diminishing Step Sizes: The primary theoretical contribution is the demonstration that delay-adaptive schemes are unnecessary. A standard, pre-chosen diminishing step size is sufficient to achieve optimal convergence rates, matching the performance of adaptive schemes up to logarithmic factors.
Optimal Rates: The analysis recovers the optimal convergence rates for classical SGD (without delays) for non-convex and strongly convex objectives, despite the presence of bias and growing delays.

4. Main Results

The paper establishes convergence guarantees for three types of objective functions under Assumptions 1–3 (Smoothness, Bounded Variance/Bias, Scaled Delay):

A. Non-Convex Functions

Metric: Stationarity measured by the expected squared norm of the projected gradient mapping $h(t)$ .
Result: With a diminishing step size $\eta(t) = \frac{\eta_0}{(t+1)^\alpha}$ ( $\alpha \in (0,1)$ ), the average stationarity gap is $O(1)$ .
Significance: This matches the convergence rate of classical unbiased SGD without delay, proving that bias and scaled delays do not degrade the asymptotic rate for non-convex problems.

B. Strongly Convex Functions

Metric: Mean Squared Error (MSE) $E[\|x(t) - x^*\|^2]$ .
Result: With $\eta(t) = \frac{\eta_0}{t+1}$ and diminishing bias $q(t) \propto t^{-\beta}$ ( $\beta \geq 1/2$ ), the MSE converges at a rate of $O(1/t)$ .
Significance: This recovers the optimal $O(1/T)$ rate of standard SGD, showing that the scaled-delay model does not hinder the linear convergence speed of strongly convex problems.

C. General Convex Functions

Metric: Function value gap $E[f(\tilde{x}(T))] - f^*$ , where $\tilde{x}(T)$ is a weighted average of iterates.
Result: With $\eta(t) = \frac{\eta_0}{\sqrt{t+1}}$ , the error bound is $O\left(\frac{\log T}{\sqrt{T}}\right)$ .
Significance: This matches the optimal rate for classical SGD ( $O(1/\sqrt{T})$ ) up to a logarithmic factor. It improves upon or matches delay-adaptive methods (which typically achieve $O(\frac{\log T}{\sqrt{T}})$ ) without the computational overhead of calculating adaptive steps.

5. Significance and Implications

Simplification of Algorithms: The paper challenges the necessity of complex delay-adaptive mechanisms in federated learning. Practitioners can rely on simple, pre-defined diminishing step sizes, reducing implementation complexity and communication overhead.
Robustness to Real-World Conditions: By accommodating biased gradients (common in zeroth-order methods and quantization) and growing delays (scaled-delay), the theory is more applicable to real-world distributed systems than previous works assuming bounded delays and unbiased gradients.
Theoretical Insight: The analysis clarifies the interaction between bias, stochasticity, constraints, and delays. It shows that as long as the step size diminishes appropriately, the system naturally compensates for the noise and staleness introduced by the distributed environment.
Future Directions: The authors suggest that tighter bounds for convex functions (removing the $\log T$ factor) and extensions to fully decentralized architectures are promising areas for future research.

In summary, the paper proves that for federated learning with stale and biased gradients, "Diminishing Step Size is All You Need" to achieve near-optimal convergence, rendering complex adaptive strategies redundant for these specific theoretical guarantees.

Convex and Non-convex Federated Learning with Stale Stochastic Gradients: Diminishing Step Size is All You Need

The Old Way vs. The New Way

The "Secret Sauce": The Diminishing Step Size

The Three Scenarios They Tested

Why This Matters

1. Problem Formulation

2. Methodology

3. Key Contributions

4. Main Results

A. Non-Convex Functions

B. Strongly Convex Functions

C. General Convex Functions

5. Significance and Implications

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models