Stochastic and incremental subgradient methods for convex optimization on Hadamard spaces

Imagine you are trying to find the perfect meeting spot for a group of friends who are scattered across a strange, warped landscape. In our normal, flat world (like a city grid), finding the "middle" point is easy: you just average the coordinates. But what if the world isn't flat? What if it's a Hadamard space—a mathematical landscape that curves away from you like the surface of a saddle or a hyperbolic plane, or even a tree-like structure where paths branch out and never loop back?

This paper tackles the problem of optimization (finding the best spot) in these weird, curved worlds. Specifically, it introduces a new, simpler way to take steps toward the solution without getting lost in complex math.

Here is the breakdown using everyday analogies:

1. The Problem: The "Flat World" Tools Don't Work

In our normal, flat world (Euclidean space), if you want to find the best spot, you use a tool called a subgradient. Think of a subgradient as a "slope sign" on a hill. It tells you: "If you want to go downhill, walk in this direction."

But in these curved, tree-like worlds, there are no straight lines and no flat "slopes" in the traditional sense. The old tools rely on linear math (like vectors and angles) that simply don't exist here. Trying to use the old tools is like trying to navigate a forest using a compass that only works on a flat map; it just doesn't fit the terrain.

2. The New Tool: The "Busemann Subgradient"

The authors invent a new kind of "slope sign" called a Busemann subgradient.

The Old Way (The Horizon): Imagine standing on a hill. In the old method, you had to look at the horizon and imagine a flat plane touching the hill. This is hard to do in a curved world.
The New Way (The Infinite Ray): Instead, imagine a ray of light shooting out from your feet toward infinity. In this new method, the "slope" isn't a flat plane; it's a direction and a speed.
- Direction: Which way should you walk? (The ray).
- Speed: How fast should you walk? (The "speed" factor).

This new tool is "primal," meaning it works directly on the ground you are standing on, rather than trying to project the problem onto a flat, imaginary map. It's like having a GPS that tells you, "Walk North at 3 miles per hour," rather than trying to calculate the angle of the slope relative to a flat horizon.

3. The Strategy: The "Stochastic" and "Incremental" Walks

The paper proposes two ways to use this new tool to find the best spot (the median or mean of a group of points):

The Stochastic Method (The Random Picker): Imagine you have a list of friends (data points) and you want to find the spot that minimizes the total walking distance to all of them.
- Old way: You calculate the distance to everyone at once, then take a step. This is slow and computationally heavy.
- New way: You close your eyes, pick one friend at random, ask them, "Which way is downhill for me?" and take a small step in that direction. Then you pick another random friend and repeat.
- Why it works: Even though you are only listening to one person at a time, over thousands of steps, you naturally drift toward the perfect center. It's like finding your way through a dark room by tapping a few random walls; eventually, you find the exit.
The Incremental Method (The Round-Robin): Instead of picking randomly, you go through your list of friends one by one in a circle. You listen to Friend A, take a step. Then Friend B, take a step. Then Friend C.
- This is often faster because you systematically cover all the information without the "noise" of randomness.

4. The Real-World Application: The "Tree Space"

The authors test this on BHV Tree Space. This is a fancy way of describing a map of evolutionary trees (phylogenetic trees).

Imagine you have 100 different family trees for a group of species.
You want to find the "average" family tree that represents the group best.
In this space, "distance" means how much you have to rearrange branches to turn one tree into another.
The authors used their new algorithm to find the "median tree" (the one that is, on average, closest to all the others). They showed that their method works just as well as the old, heavy-duty methods but is much simpler to implement and understand.

5. Why This Matters

Simplicity: The new method doesn't require complex "dual" math (which is like trying to solve a puzzle by looking at its shadow). It works directly with the geometry of the space.
Speed: It guarantees that you will find a good enough answer within a predictable number of steps (complexity bounds), just like the best methods in flat worlds.
Versatility: It works not just for trees, but for any curved space, including hyperbolic geometry (used in AI and network analysis) and spaces of positive-definite matrices (used in medical imaging).

The Takeaway

The authors essentially said: "Stop trying to force flat-world math onto curved landscapes. Instead, give the walker a compass (a ray) and a pace (a speed), and let them take small, smart steps one friend at a time. It's simpler, faster, and works everywhere."

They proved that this "Busemann subgradient" approach is the key to unlocking efficient optimization in the strange, beautiful, and curved geometries of the modern mathematical world.

Here is a detailed technical summary of the paper "Stochastic and incremental subgradient methods for convex optimization on Hadamard spaces" by Goodwin, Lewis, López-Acedo, and Nicolae.

1. Problem Statement

The paper addresses the problem of minimizing a convex objective function $f$ over a feasible set $C$ within a Hadamard space $(X, d)$ . A Hadamard space is a complete, simply connected metric space with non-positive curvature (the CAT(0) property). This framework generalizes Euclidean and Hilbert spaces to include:

Riemannian manifolds of non-positive sectional curvature (e.g., hyperbolic space, spaces of positive-definite matrices).
Non-manifold structures like BHV tree spaces (Billera-Holmes-Vogtmann) and CAT(0) cubical complexes.

The objective function is assumed to have an additive structure:
$f(x) = \sum_{i=1}^m f_i(x)$
where each component $f_i$ is geodesically convex. The goal is to develop stochastic and incremental subgradient algorithms that provide guaranteed complexity bounds (specifically $O(\epsilon^{-2})$ iterations to reach an $\epsilon$ -optimal solution) without relying on the linear structure of Euclidean space.

The Core Challenge:
In Euclidean space, subgradients are linear functionals (dual objects). In general Hadamard spaces, there is no linear structure, making the definition of a "subgradient" ambiguous. Previous approaches relied on:

Local Linearization: Using tangent spaces and exponential maps, which requires lower curvature bounds and fails in general CAT(0) spaces (like trees).
Horospherical Subgradients: Using supporting rays to level sets. While global, this approach lacks a calculus for sums (i.e., the sum of horospherically convex functions is not necessarily horospherically convex), making it unsuitable for splitting methods.

2. Methodology and Key Concepts

A. Busemann Subgradients

The authors introduce a new notion of subgradient based on Busemann functions, which are natural generalizations of affine functions in non-positive curvature.

Busemann Function: For a ray $r$ , $b_r(z) = \lim_{t \to \infty} (d(z, r(t)) - t)$ .
Boundary at Infinity ( $X_\infty$ ): The set of equivalence classes of asymptotic rays.
Boundary Cone ( $CX_\infty$ ): The product space $X_\infty \times \mathbb{R}_+$ modulo an equivalence relation, representing a "direction" and a "speed."
Definition: An element $[\xi, s] \in CX_\infty$ (where $\xi$ is a direction and $s \ge 0$ is a speed) is a Busemann subgradient of $f$ at $x$ if:
$f(y) - f(x) \ge s(b_\xi(y) - b_\xi(x)) \quad \forall y \in C$
Geometrically, this corresponds to a ray starting at $x$ in direction $\xi$ traversed at speed $s$ that "supports" the function.

B. Properties and Calculus

Subdifferentiability vs. Convexity: While Busemann subdifferentiability implies geodesic convexity, the converse is not always true (e.g., distance functions to certain sets in tree spaces).
Chain Rule: If $f$ is Busemann subdifferentiable and $g$ is non-decreasing convex, then $g \circ f$ is Busemann subdifferentiable. This allows handling functions like $d(x, a)^p$ .
Failure of Additivity: Crucially, the sum of two Busemann subdifferentiable functions is not necessarily Busemann subdifferentiable. This necessitates the use of splitting methods (stochastic or incremental) where components are handled individually rather than summing their subgradients.

C. Algorithms

The paper proposes two algorithms that utilize a Busemann Oracle (which returns $[\xi, s]$ for a given $f_i$ and $x$ ):

Stochastic Busemann Subgradient Method (Algorithm 1):
- At each iteration $k$ , randomly select a component $f_{i(k)}$ .
- Query the oracle for a subgradient $[\xi_k, s_k]$ .
- Update: Move along the ray $r_{x_k, \xi_k}$ at speed $s_k$ for time $t_k$ , then project back onto $C$ .
- $x_{k+1} = P_C(r_{x_k, \xi_k}(s_k t_k))$ .
Incremental Busemann Subgradient Method (Algorithm 2):
- Cyclically iterate through all components $f_1, \dots, f_m$ .
- Perform a subgradient step for each component sequentially within one outer iteration.
- $x_{k, i+1} = P_C(r_{x_{k,i}, \xi_{k,i}}(s_{k,i} t_k))$ .

3. Key Contributions

New Subgradient Definition: The introduction of Busemann subgradients provides a global, primal, and linear-structure-free definition of subgradients suitable for general Hadamard spaces.
Complexity Guarantees: The authors prove that both stochastic and incremental methods achieve an $O(\epsilon^{-2})$ iteration complexity to find an $\epsilon$ $ϵ$ -approximate minimizer. This matches the best-known rates for Euclidean subgradient methods.
- The stochastic method converges in expectation.
- The incremental method converges deterministically.
Handling Constraints: Unlike previous proximal methods in this setting, these algorithms naturally handle hard constraints via projection onto the feasible set $C$ .
Calculus Limitations: The paper rigorously demonstrates that Busemann subdifferentiability is not preserved under addition, justifying the necessity of splitting approaches for composite objectives.

4. Results and Experiments

Theoretical Bounds:
- For the stochastic method, with step size $t_k = \frac{D}{Lm\sqrt{k+1}}$ , the expected suboptimality satisfies:
  $\mathbb{E}[\min_{i=1,\dots,k} f(x_i) - f_{opt}] \le \frac{2(1+\log 3)mLD}{\sqrt{k+2}}$
- Similar bounds are derived for the incremental method.
Computational Experiments (BHV Tree Space):
- The authors tested the algorithms on the BHV tree space ( $T_4$ ) to compute the median (1-mean) of a set of phylogenetic trees.
- Comparison: They compared their subgradient methods (Algorithms 3 & 4) against a cyclic proximal point method (Algorithm 5).
- Findings:
  - The subgradient methods performed competitively with proximal methods.
  - The incremental method with step size $t_k = 1/k$ showed excellent practical convergence, often outperforming the theoretically optimal $1/\sqrt{k}$ step size in early iterations.
  - The stochastic method required significantly fewer oracle calls (1/3 of the incremental) to achieve similar performance, highlighting its efficiency for large $m$ .
  - The algorithms successfully handled the "stickiness" phenomenon in tree spaces (where the median lies on a lower-dimensional face).

5. Significance

Bridging Theory and Practice: This work provides the first rigorous complexity analysis for stochastic and incremental subgradient methods in general Hadamard spaces, moving beyond local linearization techniques that fail in non-manifold settings.
Applicability: The framework is directly applicable to problems in phylogenetics (averaging evolutionary trees), computer vision (shape analysis), and machine learning on manifolds, where data naturally resides in non-Euclidean, non-positively curved spaces.
Algorithmic Flexibility: By avoiding the need for lower curvature bounds or specific manifold structures, the proposed Busemann subgradient approach is more robust and widely applicable than previous methods. It offers a practical alternative to proximal methods, which are often difficult to implement for complex constraints or non-smooth objectives in these spaces.

In summary, the paper establishes a foundational theory for subgradient optimization in non-linear spaces by leveraging Busemann functions, enabling efficient, scalable algorithms for convex optimization on complex geometric structures like tree spaces.