Adaptive Polyak Stepsize with Level-value Adjustment for Distributed Optimization

Imagine a group of friends trying to find the lowest point in a vast, foggy valley. They can't see the bottom, and they can't talk to everyone at once; they can only whisper to their immediate neighbors. This is the essence of distributed optimization: many agents working together to solve a problem without a central boss.

The paper you shared introduces a clever new way for these friends to decide how big of a step to take toward the bottom. Here is the breakdown in simple terms:

The Problem: The "Step Size" Dilemma

In this valley, every friend has a map (an algorithm) that tells them which way is "down." But there's a catch: how big of a step should they take?

Too big: They might overshoot the bottom, bounce back up, and start running in circles (divergence).
Too small: They will eventually get there, but it will take a million years (slow convergence).

Usually, to pick the perfect step size, you need to know exactly how deep the valley is at the very bottom (the "global optimum"). But in this scenario, no one knows where the bottom is yet. They are all guessing.

The Old Solution: The "Polyak Step"

There is a famous, brilliant method called the Polyak Stepsize. It's like a magic compass that says, "The further you are from the bottom, the bigger your step should be. As you get closer, take smaller steps."

The problem? This magic compass requires you to know the exact height of the valley floor ( $f^*$ ) to work. Since the friends don't know the bottom, they can't use this compass. If they try to guess the bottom and use the compass anyway, they often end up running in circles and crashing (diverging), as shown in the paper's Figure 1.

The New Solution: "DPS-LA" (The Smart Guessers)

The authors created a new algorithm called DPS-LA (Distributed Polyak Stepsize with Level-value Adjustment). Think of it as a team of friends who are smart guessers rather than know-it-alls.

Here is how they do it, using a creative analogy:

1. The "Sliding Window" (The Level-Value Adjustment)

Instead of guessing the bottom once and sticking with it, the friends use a sliding window of their recent history.

Imagine every friend keeps a notebook of their last few steps.
They ask: "If I assume the bottom is at height X, does my recent path make sense?"
They draw a "safety zone" (a half-space) based on their last few moves. If their assumption about the bottom height is wrong, their recent path will look like it's breaking the rules of physics (the math becomes "infeasible").

2. The "Self-Correcting Mechanism"

When the math breaks (the path looks impossible), the friend realizes: "Ah! My guess about the bottom height was too low. I need to raise my guess."

They update their guess to be a little higher (closer to the truth).
They keep doing this. Every time they hit a contradiction, they refine their guess.
Over time, their guess of the "bottom height" gets tighter and tighter, almost as if they are slowly uncovering the true bottom without ever seeing it directly.

3. The "Team Huddle" (Consensus)

In a distributed network, everyone is also trying to agree on where they are standing.

The algorithm forces everyone to average their positions with their neighbors (like a team huddle).
This ensures that even though they are guessing the bottom height individually, they are all moving toward the same spot.

Why is this a Big Deal?

No Magic Knowledge Needed: You don't need to know the answer beforehand. The algorithm figures it out on the fly.
No Manual Tuning: Usually, you have to spend hours tweaking the "step size" settings. This algorithm adjusts itself automatically.
Linear Speedup: This is the coolest part. If you double the number of friends (agents) helping to find the bottom, you cut the time in half. The paper proves mathematically that the more people you add, the faster the whole group solves the problem.

The Real-World Test

The authors tested this with a computer simulation of 4 agents trying to solve a math puzzle.

The Old Way (DGD): The friends walked slowly, taking tiny, cautious steps. It took a long time to get close to the answer.
The New Way (DPS-LA): The friends took big, confident steps at first, then slowed down perfectly as they got closer. They found the answer much faster and more accurately.

Summary

Think of DPS-LA as a group of hikers in a foggy mountain range. Instead of waiting for a guide to tell them the summit's elevation, they constantly check their recent footsteps. If their path looks weird, they adjust their mental map of the mountain. By doing this together, they find the peak much faster than if they were walking alone or following a rigid, pre-set plan.

It turns a difficult, "guess-the-answer" problem into a self-correcting, collaborative journey.

Here is a detailed technical summary of the paper "Adaptive Polyak Stepsize with Level-value Adjustment for Distributed Optimization."

1. Problem Statement

The paper addresses the challenge of stepsize selection in distributed constrained optimization problems involving multi-agent systems.

Context: Agents collaboratively minimize a global objective function $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$ , where each agent $i$ holds a local convex function $f_i$ and communicates only with neighbors over a network.
The Core Challenge:
- Standard Gradient Descent (DGD): Typically relies on diminishing stepsizes (slow convergence) or constant stepsizes (convergence to a neighborhood with steady-state error).
- Polyak Stepsize: In centralized settings, the Polyak stepsize $\gamma_k = \frac{f(x_k) - f^*}{\|g_k\|^2}$ offers parameter-free adaptability and fast convergence. However, it requires knowledge of the global optimal value $f^*$ (or local optimal values $f_i^*$ at the global optimum), which is unavailable in distributed settings.
- Direct Application Failure: The authors demonstrate that naively applying the standard Polyak stepsize to Distributed Gradient Descent (DGD) leads to algorithmic divergence. This occurs because local function-value gaps do not accurately reflect global progress, causing misalignment between local descent and network consensus.

2. Methodology: DPS-LA Algorithm

The authors propose a novel algorithm called DPS-LA (Distributed Polyak Stepsize with Level-value Adjustment). The methodology consists of three key components:

A. Distributed Update with Aggregated State

Instead of using the local state $x_{i,k}$ , each agent computes an aggregated state $z_{i,k} = \sum_{j=1}^n w_{ij} x_{j,k}$ (a weighted average of neighbors). The update rule is:
$x_{i,k+1} = P_X \left( z_{i,k} - \alpha_{i,k} \nabla f_i(z_{i,k}) \right)$
where $P_X$ is the projection onto the constraint set.

B. Level-Value Adjustment (The Core Innovation)

To replace the unknown global optimal value $f^*$ , the algorithm employs a Level-Value Adjustment mechanism inspired by the Polyak Stepsize Violation Detector (PSVD):

Local Estimation: Each agent maintains a level-value estimate $\bar{f}_{i,k}$ intended to approximate $f_i(x^*)$ .
Feasibility Check: At each iteration, the agent solves a lightweight linear feasibility problem over a sliding window of $\eta$ iterations. The constraints are derived from the condition that the optimal solution must lie within specific half-spaces defined by the gradients and the current stepsize.
$(\nabla f_i(z_{i,k}))^T (x - z_{i,k}) \leq -\frac{1}{\bar{\gamma}} \beta_{i,k} \|\nabla f_i(z_{i,k})\|^2$
Update Rule: If the system of inequalities becomes infeasible, it implies the current level estimate $\bar{f}_{i,k}$ is too low (an underestimate). The agent updates the level value using a convex combination of the previous estimate and the minimum observed function value in the window:
$\bar{f}'_i = \frac{\gamma}{\bar{\gamma}} \bar{f}_{i,k} + \left(1 - \frac{\gamma}{\bar{\gamma}}\right) \min_{k \in \text{window}} f_i(z_{i,k})$
This ensures the estimate monotonically tightens toward the true $f_i(x^*)$ .

C. Decaying Mechanism

To guarantee exact convergence to the optimum (rather than a neighborhood), the algorithm incorporates a decaying factor $c_k$ (where $c_k = \sqrt{k+1}$ ) into the stepsize:
$\alpha_{i,k} = \frac{1}{c_k} \min \left\{ \max \left\{ \beta_{i,k}, \frac{c_0 \alpha_0}{2} \right\}, c_{k-1} \alpha_{i,k-1} \right\}$
This allows the stepsize to adapt aggressively initially but decay slowly enough to ensure convergence.

3. Key Contributions

Algorithmic Innovation: The first distributed Polyak stepsize algorithm that operates without prior knowledge of the global optimal value. It replaces the need for global information with a distributed, online level-value adjustment via linear feasibility checks.
Theoretical Guarantees:
- Consensus: Proved that agents reach consensus ( $\lim_{k\to\infty} \|x_{i,k} - x_{j,k}\| = 0$ ).
- Convergence: Proved that the level-value estimates $\bar{f}_{i,k}$ asymptotically converge to $f_i(x^*)$ , not just local minima.
- Rate: Established a sublinear convergence rate of $O\left(\frac{1}{\sqrt{nT}}\right)$ for the objective optimality gap, where $n$ is the number of agents and $T$ is the number of iterations.
Linear Speedup: The $O(1/\sqrt{nT})$ rate implies linear speedup with respect to the number of agents. As $n$ increases, the total communication rounds required to reach a specific accuracy decrease proportionally.

4. Results

Numerical Experiments: The algorithm was tested on a distributed quadratic optimization problem with $N=4$ $N = 4$ agents.
- Convergence: DPS-LA demonstrated significantly faster convergence compared to standard DGD with diminishing stepsizes. It reached near-zero residual error within 50 iterations, whereas DGD was still slowly decreasing after 300 iterations.
- Level-Value Tracking: The estimated level values $\bar{f}_{i,k}$ rapidly converged to the true optimal values $f_i(x^*)$ .
- Scalability: Experiments varying the number of agents (3, 4, 5) confirmed that increasing the network size improved the convergence rate, validating the theoretical linear speedup.
Stability: The algorithm successfully avoided the divergence observed when naively applying Polyak stepsizes to DGD.

5. Significance

Bridging the Gap: This work successfully bridges the gap between the theoretical efficiency of Polyak stepsizes and the practical constraints of distributed systems where global information is unavailable.
Parameter-Free Adaptability: It eliminates the need for manual tuning of stepsize parameters or knowledge of Lipschitz constants, making the algorithm more robust and easier to deploy in real-world scenarios (e.g., smart grids, federated learning).
Computational Efficiency: The level-value adjustment requires solving only simple linear feasibility problems, making it computationally lightweight compared to other distributed optimization methods that might require complex consensus tracking or dual variables.
Theoretical Milestone: It provides the first theoretical guarantee for a distributed Polyak stepsize algorithm achieving linear speedup without prior knowledge of the optimal value.