Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models

Imagine you are trying to teach a massive class of students (the AI model) how to recognize cats in photos. But instead of having all the photos in one big classroom, the photos are scattered across 100 different houses (the distributed nodes).

In the old days, to teach the class, a teacher would have to run to every house, pick up a photo, bring it back, show it to the class, update the lesson plan, and repeat. This is slow and exhausting because of all the running back and forth (communication cost).

To speed things up, we invented a new method called Local-GD (or Federated Averaging). Here's how it works:

The teacher sends the current lesson plan to all 100 houses.
Each house stays home and studies their own photos for a while, making small updates to the lesson plan on their own (Local Steps).
After studying for a bit, they send their updated plans back to the teacher.
The teacher averages all 100 plans to create a new "Global Plan" and sends it out again.

The Big Problem:
If the houses have very different types of photos (some have only kittens, some have only tigers, some have no cats at all), and the students study for too long on their own before sending the plan back, the teacher worries: "Will the final plan be a mess? Will it converge to a solution that only works for House #3 and fails for House #7?"

For years, theory suggested that if students studied too much locally, the group would get lost. But in practice, engineers noticed something weird: Even with huge amounts of local study, the group still learns incredibly well.

This paper answers the question: Why does this work, and exactly what solution are we actually finding?

The Core Discovery: The "North Star" Effect

The authors discovered that in the world of modern AI (where models are "overparameterized," meaning they are huge and have more variables than data points), there isn't just one perfect solution. There are millions of ways to get a 100% score on the training data.

The paper proves that Local-GD always finds the same specific solution as if everyone had met in one room.

The Analogy: The Hiking Group

Imagine a group of hikers trying to find the highest peak in a foggy mountain range (the Global Optimal Solution).

Centralized Training: Everyone stands in one spot, looks at the whole map, and walks straight to the peak.
Local-GD: The group splits up. Each subgroup walks in their own direction for a while, then they all meet, compare notes, and walk together again.

The paper proves that even if the subgroups walk for a very long time on their own (even if the terrain is different for each group), as long as they keep meeting up and averaging their directions, they will all eventually point toward the exact same peak.

It doesn't matter if they walk 10 steps or 1,000 steps locally; the "direction" of the group always aligns with the "North Star" (the Max-Margin Solution).

Key Takeaways in Plain English

1. The "Direction" Matters More Than the "Distance"
In these huge AI models, the model's "size" (how big the numbers get) can grow infinitely. But the direction it points in is what matters for accuracy. The paper shows that Local-GD points in the exact same direction as the Centralized method. It's like two arrows flying through the air; even if one is longer than the other, if they point at the same target, they hit the same bullseye.

2. Why "Too Much" Local Study is Actually Good
Usually, we think "too much local study" leads to students getting stuck in their own echo chambers. But the paper shows that for these specific types of problems, taking more local steps actually helps the group converge to the best possible solution faster. It's like letting each student think deeply about their specific problem before sharing; the final group decision is smarter, not dumber.

3. The "Modified" Secret Sauce
The authors also found a tiny tweak to the algorithm (changing how the teacher averages the plans) that guarantees the group finds the perfect solution even if the learning speed isn't perfectly tuned. It's like adding a small "compass correction" to ensure the hikers never drift off course, no matter how long they walk alone.

Why Should You Care?

It explains the magic: We've been using this method (Federated Learning) to train AI on phones and hospitals without sharing private data. This paper explains why it works so well even when data is messy and different across devices.
It saves money: Because we know we can take more local steps without breaking the model, we can reduce the number of times devices need to talk to the server. This saves battery life and internet bandwidth.
It builds trust: It gives us a mathematical guarantee that this distributed method isn't just a "hack" that works by luck; it is mathematically proven to find the same high-quality solution as a centralized supercomputer.

In summary: This paper is the "user manual" for a popular AI training method. It tells us that even if we let the students work alone for a long time in different rooms, as long as they check in with each other, they will all end up solving the puzzle in the exact same brilliant way.

1. Problem Statement

In distributed machine learning (e.g., Federated Learning), Local (Stochastic) Gradient Descent (Local-(S)GD), also known as FedAvg, is a standard method to reduce communication costs. In this approach, compute nodes perform $L$ local gradient updates on their private datasets before aggregating their models at a central server.

While Local-GD is widely used, its theoretical behavior in the overparameterized regime (where model dimension $d$ exceeds the total number of samples) remains poorly understood. In this regime, there are infinitely many solutions that achieve zero training loss. The core question addressed by this paper is:

Which specific solution does the aggregated global model from Local-GD converge to, especially when data is heterogeneous across nodes?

Existing literature often focuses on convergence rates of the loss value but fails to characterize the implicit bias (the specific direction or solution the model converges to) in distributed settings with large numbers of local steps ( $L$ ).

2. Methodology and Theoretical Framework

The authors analyze the implicit bias of Local-GD for classification tasks with linearly separable data using linear models. They compare the trajectory of the distributed Local-GD model against a centralized model (trained via GD on the union of all datasets).

Key Assumptions:

Overparameterization: The model dimension $d$ is larger than the total number of samples ($d > MN$).
Linear Separability: The global dataset is linearly separable (there exists a hyperplane separating all classes).
Loss Function: Exponential or logistic loss functions with specific tail properties (exponential decay).
Learning Rate: Two scenarios are analyzed:
1. $\eta = O(1/L)$ : Standard learning rate scaling used in existing distributed learning analyses.
2. $\eta$ independent of $L$ : A scenario simulating exact local optimization (large $L$ ).

Analytical Approach:

Linear Regression Baseline: The authors first establish intuition using linear regression. They show that in the overparameterized regime, Local-GD iteratively projects the difference between the global model and the centralized solution onto the null space of the data spans, eventually driving this difference to zero.
Implicit Bias Analysis for Classification:
- They define the global max-margin solution ( $\hat{w}$ ) as the minimum norm vector that separates the global dataset with a margin of at least 1.
- They decompose the global model $w_k$ into a dominant term growing logarithmically with the total number of steps ($Lk$) and a bounded residual term $\rho_k$ .
- They utilize techniques from the analysis of centralized SGD implicit bias (e.g., Soudry et al., 2018) but adapt them to handle the aggregation step and multiple local updates within the same "batch."
Parallel Projection Method (PPM):
- For the case where local steps are large enough to solve local problems exactly (with weak regularization), the authors map Local-GD to a Parallel Projection Method.
- They show that Local-GD effectively performs parallel projections onto local feasible sets followed by averaging.
- To guarantee convergence to the exact centralized minimum-norm solution (rather than just the feasible set), they propose a Modified Local-GD algorithm that incorporates a specific weighting of the initial point during aggregation.

3. Key Contributions

A. Convergence to Centralized Model (Direction)

The paper proves that for linearly separable data, the global model obtained by Local-GD converges in direction to the centralized max-margin solution ( $\hat{w}$ ), regardless of the number of local steps $L$ or data heterogeneity.

Theorem 2 (Standard Local-GD): With $\eta = O(1/L)$ , the normalized global model converges to $\hat{w}/\|\hat{w}\|$ at a rate of $O(1/\log(Lk))$ . The training loss converges at $O(1/Lk)$.
Significance: This explains why Local-GD works well in practice even with massive $L$ (e.g., 500 steps in LLM training), as it does not diverge from the optimal centralized solution direction.

B. Extension to Local-SGD

The authors extend their results to Local-SGD (sampling mini-batches without replacement). They prove that the implicit bias remains identical to Local-GD because each local mini-batch is a subset of the global dataset, preserving the underlying geometry of the optimization landscape.

C. Learning Rate Independence and Modified Algorithm

Problem: Standard Local-GD requires $\eta = O(1/L)$ to ensure stability, which can be restrictive.
Solution: The authors propose a Modified Local-GD algorithm. By modifying the aggregation rule to include a decaying weight of the initial point (inspired by extrapolated parallel projection methods), they prove that the model converges to the centralized solution even with a learning rate independent of $L$ .
Theorem 7: Under this modification, the global model converges in direction to the centralized minimum-norm solution without the $O(1/L)$ constraint.

D. Theoretical Characterization of Dynamics

The paper provides an exact characterization of the model dynamics, showing that the "residual" error between the distributed and centralized models is bounded and decays, rather than accumulating as suggested by some worst-case analyses for underparameterized settings.

4. Key Results

Exact Directional Convergence: In the overparameterized regime, Local-GD does not suffer from "client drift" in terms of the final solution direction. It converges to the same max-margin direction as if all data were centralized.
Convergence Rates:
- Directional Error: $O(1/\log(Lk))$ .
- Loss Convergence: $O(1/Lk)$.
- These rates match those of centralized GD, implying that local steps do not degrade the asymptotic convergence rate for smooth, convex losses in this regime.
Linear Regression: In the overparameterized linear regression setting, the global model converges to the centralized solution with an exponential rate ( $O((1-\theta_{min})^K)$ ), provided the initial point is zero.
Experimental Validation:
- Linear Classification: Experiments show that the difference between the Local-GD model and the centralized model approaches zero as the number of rounds increases, even with high data heterogeneity.
- Neural Network Fine-tuning: Experiments on fine-tuning the last layer of a pre-trained ResNet50 on CIFAR-10 with heterogeneous data show that the Local-GD model achieves test accuracy and directional similarity to the centralized model comparable to standard training.

5. Significance and Impact

Theoretical Justification for Practice: The paper resolves the paradox of why Local-GD (FedAvg) performs well in practice with large $L$ and heterogeneous data, despite theoretical bounds suggesting otherwise for underparameterized models. It attributes this success to the implicit bias of overparameterized models.
Guidance for Hyperparameters: It suggests that practitioners can use large numbers of local steps without fear of converging to a suboptimal solution direction, provided the learning rate is scaled appropriately (or the modified aggregation is used).
New Algorithmic Insight: The proposal of Modified Local-GD offers a theoretical pathway to remove the restrictive $O(1/L)$ learning rate constraint while maintaining convergence guarantees to the centralized optimum.
Connection to Parallel Projections: By linking Local-GD to Parallel Projection Methods (PPM), the paper opens a new avenue for analyzing distributed optimization using tools from convex feasibility and signal processing.

In summary, this work provides a rigorous theoretical foundation demonstrating that Local-GD implicitly converges to the centralized max-margin solution in overparameterized settings, validating its widespread use in modern distributed deep learning pipelines.