Convex and Non-convex Federated Learning with Stale Stochastic Gradients: Diminishing Step Size is All You Need

This paper proposes a general framework for federated learning with delayed stochastic gradients, demonstrating that a pre-chosen diminishing step size is sufficient to achieve optimal convergence rates for both convex and non-convex objectives, thereby eliminating the need for complex delay-adaptive schemes.

Xinran Zheng, Tara Javidi, Behrouz Touri

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine a massive, global cooking competition where a head chef (the Central Server) wants to create the perfect dish. However, the chef doesn't have all the ingredients or the skills to do it alone. Instead, they have n assistants (the Agents) scattered all over the world, each with their own unique pantry and cooking style.

The goal is for everyone to work together to minimize the "badness" of the final dish (the Global Objective).

Here is the problem: The assistants are far away. They send their advice back to the chef, but the advice is often:

  1. Noisy: They might guess the taste based on a tiny sample (Stochastic).
  2. Biased: They might be using a weird measuring cup that always adds a little too much salt (Biased).
  3. Late: Because of bad internet or traffic, the advice the chef receives today was actually written down yesterday, or even last week (Stale/Delayed).

The Old Way vs. The New Way

The Old Way (Adaptive Step Sizes):
In the past, when the chef received late or messy advice, they thought, "Oh no, this is tricky! I need to be super smart. I need to constantly change my cooking speed based on how late the advice is." If the advice was very old, they would slow down. If it was fresh, they would speed up. This required a complex, constantly adjusting algorithm.

The New Way (This Paper's Discovery):
The authors of this paper, Xinran Zheng, Tara Javidi, and Behrouz Touri, discovered something surprisingly simple: You don't need to be that smart.

They found that the chef doesn't need to constantly adjust their speed based on the delays. Instead, they just need to follow a simple rule: "Start fast, but slow down gradually over time."

Think of it like driving a car down a foggy, winding road where the GPS signal is lagging.

  • The Old Idea: "The GPS is lagging by 5 seconds! I must brake hard! Now it's lagging by 2 seconds, I can accelerate!" (This is the Adaptive Step Size).
  • The New Idea: "I'll just drive fast at first, but I'll slowly take my foot off the gas as I get closer to the destination, regardless of the GPS lag." (This is the Diminishing Step Size).

The "Secret Sauce": The Diminishing Step Size

The paper proves that if you simply start with a big step and make your steps smaller and smaller as time goes on (mathematically, a "diminishing step size"), you will reach the perfect dish just as fast as the complex, adaptive method.

Why does this work?

  • Early on: You need big steps to get moving quickly. The delays don't matter much because you are far from the solution anyway.
  • Later on: As you get close to the perfect dish, you need to be precise. By making your steps tiny, you stop overshooting the target, even if the advice you are using is slightly old or slightly wrong.

The Three Scenarios They Tested

The authors tested this "slow down gradually" rule on three types of cooking challenges:

  1. The "Messy Kitchen" (Non-Convex):
    Imagine a kitchen with many hidden traps and dead ends. The goal is just to find any spot where the food tastes good (a local peak).

    • Result: The simple "slow down" rule works just as well as the complex adaptive rule. It finds a good spot efficiently.
  2. The "Perfect Bowl" (Strongly Convex):
    Imagine a smooth, bowl-shaped valley where there is only one perfect bottom point.

    • Result: The simple rule finds the bottom of the bowl at the fastest possible speed known to science, even with the delays and bad data.
  3. The "Flat Plateau" (General Convex):
    Imagine a wide, flat area where the ground is mostly level, but you want to find the absolute lowest point.

    • Result: The simple rule gets you there almost as fast as the complex adaptive method. It might be a tiny bit slower (like a logarithmic factor, which is math-speak for "a little bit of extra time"), but it's practically the same.

Why This Matters

In the real world of Federated Learning (like training AI on your phone without sending your photos to a server), data is messy, devices are slow, and connections drop.

  • Before: Engineers had to build complex systems to detect delays and adjust learning rates on the fly. This is hard to code and hard to maintain.
  • Now: This paper says, "Just use a simple timer that tells the system to slow down over time."

The Takeaway:
You don't need a fancy, adaptive GPS to navigate a foggy road with laggy signals. You just need to know that patience pays off. If you start strong and gradually slow down, you will arrive at the destination just as effectively as the person who is frantically checking their watch and adjusting their speed every second.

In short: "Diminishing step size is all you need."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →