A Note on the Gradient-Evaluation Sequence in Accelerated Gradient Methods

Imagine you are trying to find the lowest point in a vast, foggy valley (this is your optimization problem). You can't see the bottom, but you have a magical compass that tells you which way is "down" (the gradient). Your goal is to get to the bottom as fast as possible.

For decades, the smartest hikers (algorithms) used a trick called Nesterov's Accelerated Gradient Descent (AGD). Think of this like a skier who doesn't just look at the slope right under their feet, but also leans forward, anticipating where they will be in a split second. This "look-ahead" momentum allows them to zoom down the hill much faster than a hiker who just takes one step at a time.

The Two Lines of Hikers

Here is the tricky part about how this skiing technique works. The algorithm actually keeps track of two different lines of hikers simultaneously:

The "Look-Ahead" Line (Sequence $x_k$ ): These hikers are the ones the skier leans toward. They are used to check the compass (calculate the gradient). They are the "scouts" running ahead to see the terrain.
The "Official Finish" Line (Sequence $\tilde{x}_k$ ): These are the hikers who actually stop and declare, "We are here! This is our best guess for the bottom."

The Big Question:
For a long time, mathematicians knew that the Official Finish Line hikers were incredibly fast. They could reach the bottom with a speed that gets better and better as time goes on (specifically, the error shrinks by $1/k^2$).

However, nobody knew if the Look-Ahead Line (the scouts running ahead to check the compass) was also fast enough to be considered a "winner." Could the scouts themselves be the winners? Or were they just fast runners who got lost?

In the simplest case (a flat, open field with no fences), we knew the scouts were fast. But what if the valley has fences (constraints) or strange, warped terrain (non-Euclidean settings)? Could the scouts still win? This was a mystery.

The Detective Work: The "Computer-aided" Clue

The authors of this paper decided to solve this mystery using a high-tech detective tool called PEP (Performance Estimation Problem).

Imagine you are a detective trying to prove a suspect is guilty. Instead of just guessing, you build a super-computer simulation that tries to construct the worst possible valley imaginable to see if the hikers fail.

The computer ran thousands of simulations.
It tried to break the "Look-Ahead" hikers.
The Result: The computer couldn't break them! No matter how tricky the valley or the fences, the "Look-Ahead" hikers were always finding the bottom just as fast as the "Official Finish" hikers.

The computer gave them a strong hunch: "Yes, the scouts are winners too!"

The Proof: From Hunch to Law

A hunch isn't enough for mathematicians; they need a rigorous proof. The authors took the patterns the computer found and wrote a formal mathematical argument (a "human-readable proof") to confirm it.

They proved that:

Yes, the scouts win: Even in valleys with fences (constrained problems) and weird terrain (non-Euclidean settings), the sequence of points used to check the compass ( $x_k$ ) converges to the solution just as fast as the official solution sequence.
The speed is the same: They both achieve the "Gold Medal" speed of $O(1/k^2)$ . This means if you double your steps, you get four times closer to the solution, not just two times closer.

Why Does This Matter?

Think of it like a relay race.

Before this paper: We knew the runner who crossed the finish line (the official solution) was a champion. We weren't sure if the runner who passed the baton (the gradient evaluation) was also a champion, especially if the track had obstacles.
After this paper: We now know that both runners are champions. The runner checking the path is just as skilled as the one crossing the finish line.

This is a big deal because it simplifies how we think about these algorithms. It tells us that the "scouts" aren't just a side effect; they are naturally excellent solutions in their own right. This deepens our understanding of why acceleration works and gives us more confidence in using these methods for complex real-world problems, like training AI models or optimizing supply chains, where "fences" (constraints) are everywhere.

In a nutshell: The authors used a computer to guess that the "scouts" in an accelerated algorithm are actually winners, and then they wrote a math proof to confirm it, showing that these scouts are just as fast as the official finishers, even in the most difficult, obstacle-filled landscapes.

Here is a detailed technical summary of the paper "A Note on the Gradient-Evaluation Sequence in Accelerated Gradient Methods."

1. Problem Statement

Nesterov's Accelerated Gradient Descent (AGD) is a seminal first-order method for solving convex smooth optimization problems. A distinctive feature of AGD is the existence of multiple iterate sequences:

Gradient-evaluation sequence ( $\{x_k\}$ ): Points where the gradient $\nabla f(x_k)$ is computed.
Approximate solution sequence ( $\{x_k\}$ ): Points selected as the output solution (often a convex combination of previous iterates).
Algorithm progression sequence ( $\{x_k\}$ ): Intermediate points used to drive the algorithm.

The Gap:
While the convergence rate of the approximate solution sequence ( $\{x_k\}$ ) is well-established as $O(L/k^2)$ (optimal order), the convergence properties of the gradient-evaluation sequence ( $\{x_k\}$ ) have been an open question, particularly for constrained problems (where $X$ is a closed convex set) and in non-Euclidean settings.

In unconstrained settings ( $X=\mathbb{R}^n$ ), it was known that $\{x_k\}$ could serve as approximate solutions, but the proof relied on specific optimality conditions that do not hold when projections are involved.
Open Question: Does the gradient-evaluation sequence $\{x_k\}$ in AGD (Algorithm 1) satisfy $f(x_k) - f^* \leq O(L/k^2)$ for general constrained feasible sets and non-Euclidean norms?

2. Methodology

The authors employ a hybrid approach combining Performance Estimation Problems (PEP) for hypothesis generation and rigorous theoretical analysis for proof.

A. Computer-Aided Analysis (PEP)

Motivation: Standard PEP frameworks assume new iterates lie in the linear span of previous gradients. This assumption fails for constrained AGD due to the projection step (subproblem 3 in Algorithm 1).
Dual Perspective: The authors reformulate the PEP from a dual perspective. Instead of finding a worst-case function, they seek the optimal non-negative weights for a set of inequalities used in convergence analysis.
Inequalities Used:
1. Standard smoothness/convexity inequalities: $\frac{1}{2L}\|\nabla f(y) - \nabla f(x)\|^2 \leq f(y) - f(x) - \langle \nabla f(x), y-x \rangle \leq \frac{L}{2}\|y-x\|^2$ .
2. Optimality conditions of the projection subproblem: $\langle g_k + \eta_k(x_k - x_{k-1}), x_k - x \rangle \leq 0$ for all $x \in X$ .
Process:
1. Formulate a Semidefinite Program (SDP) to minimize the worst-case error bound $d$ subject to these inequalities.
2. Run numerical experiments for specific parameter settings (e.g., $\gamma_k = 2/(k+1)$ ).
3. Observation: The numerical results confirmed an $O(1/N^2)$ convergence rate for $\{x_k\}$ . More importantly, the optimal weights revealed a specific pattern (e.g., specific coefficients for gradient differences and projection inequalities) that could be generalized into a human-readable proof.

B. Theoretical Derivation

Using the patterns identified by PEP, the authors constructed a rigorous proof that does not rely on the linear span assumption.

Core Strategy: They bound the error term $\Delta(x)$ derived in a previous proposition (Proposition 6).
Key Innovation: Instead of relying on the equality $x_k = x_{k-1} - \frac{k}{2L}g_k$ (which holds only in unconstrained Euclidean space), they utilize the optimality condition of the projection subproblem (inequality 7) combined with Cauchy-Schwarz and Young's inequalities to bound the inner products involving gradients and iterates.
Generalization: The proof is extended to handle:
- Constrained sets: By treating the projection optimality condition as a constraint in the summation.
- Non-Euclidean settings: By utilizing Bregman divergences $V(x, y)$ and the strong convexity of the distance-generating function.

3. Key Contributions

Affirmative Answer to Open Question: The paper proves that for standard AGD implementations with general closed convex feasible sets, the gradient-evaluation sequence $\{x_k\}$ achieves the optimal $O(L/k^2)$ convergence rate for the objective function value.
Extension to Non-Euclidean Settings: The results are generalized to non-Euclidean spaces using Bregman divergences, covering a broader class of problems (e.g., entropy-regularized problems).
PEP-Driven Proof Construction: The paper demonstrates a successful workflow where numerical PEP experiments identify the structure of weights and inequalities, which are then formalized into a rigorous, human-readable mathematical proof.
Parameter Robustness: The results hold for various standard parameter settings (including those where $\gamma_k \eta_k / \Gamma_k$ is non-increasing or non-decreasing), covering both compact and general feasible sets.

4. Main Results

The paper establishes the following convergence bounds for the gradient-evaluation sequence $\{x_k\}$ :

Euclidean Setting (Theorem 8):
For parameters satisfying standard AGD conditions ( $\gamma_1=1, \eta_k \geq L\gamma_k$ ):
- If $\gamma_k \eta_k / \Gamma_k$ is non-increasing:
  $f(x_N) - f^* \leq \max\left\{1, \frac{\Gamma_{N-1}^2 \gamma_N^2}{\gamma_{N-1}^2 \Gamma_N^2}\right\} \frac{\Gamma_N \eta_1}{2} \|x_0 - x^*\|^2$
- Specific case (Corollary 10, optimal parameters):
  $f(x_N) - f^* \leq \frac{2L}{N^2} \|x_0 - x^*\|^2$
- This matches the $O(1/N^2)$ rate of the approximate solution sequence.
Non-Euclidean Setting (Theorem 12):
Using Bregman divergence $V(x, y)$ :
- The bound generalizes to $f(x_N) - f^* \leq O(L/N^2) V(x_0, x^*)$ .
- Specific parameter choices (Corollaries 13–15) yield explicit constants, confirming the optimal order of complexity.

5. Significance

Algorithmic Insight: The results clarify that in AGD, the points used for gradient evaluation are not just intermediate steps but are themselves high-quality approximate solutions. This simplifies the algorithm's output requirement (no need to maintain a separate solution sequence).
Practical Implications: For constrained optimization problems (common in machine learning, signal processing, and control), practitioners can confidently use the iterate $x_k$ (where the gradient is computed) as the solution, potentially reducing memory usage or computational overhead associated with maintaining a separate output sequence.
Methodological Contribution: The paper bridges the gap between computer-aided verification (PEP) and theoretical analysis. It shows that PEP can be used not just to find counter-examples or constants, but to discover the structure of proofs for constrained problems where traditional analytical intuition might fail.
Distinction from OGM: While the Optimized Gradient Method (OGM) achieves better constants by altering the algorithm structure, this paper proves that the classical AGD structure already possesses this desirable property for the gradient-evaluation sequence, provided the analysis is performed correctly.

In summary, this work resolves a long-standing open question regarding the convergence of the gradient-evaluation sequence in AGD, extending the result to constrained and non-Euclidean settings through a novel combination of numerical optimization and theoretical proof.

A Note on the Gradient-Evaluation Sequence in Accelerated Gradient Methods

The Two Lines of Hikers

The Detective Work: The "Computer-aided" Clue

The Proof: From Hunch to Law

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Computer-Aided Analysis (PEP)

B. Theoretical Derivation

3. Key Contributions

4. Main Results

5. Significance

More like this

The *-variation of the Banach-Mazur game and forcing axioms

Modified averaged vector field methods preserving multiple invariants for conservative stochastic differential equations

The probabilistic superiority of stochastic symplectic methods via large deviations principles

Hodge-Gromov-Witten theory

Large deviations principles for symplectic discretizations of stochastic linear Schrödinger Equation