Geodesic Gradient Descent: A Generic and Learning-rate-free Optimizer on Objective Function-induced Manifolds

Imagine you are trying to find the lowest point in a vast, foggy, and incredibly complex mountain range. This mountain range isn't just a simple hill; it's a twisting, turning, multi-dimensional landscape where the ground itself curves, twists, and folds in ways that are hard to see. In the world of Artificial Intelligence (AI), this "mountain" is the Objective Function, and finding the lowest point (the bottom of the valley) means finding the perfect settings for a neural network to solve a problem.

Here is a simple breakdown of the paper's new method, Geodesic Gradient Descent (GGD), using everyday analogies.

The Problem: Walking on Flat Ground vs. Real Mountains

1. The Old Way (Euclidean Gradient Descent):
Imagine you are a hiker trying to find the bottom of a valley. The traditional method (like the popular "Adam" algorithm) acts like a hiker who ignores the curvature of the earth. They look at the slope right in front of them and take a straight step downhill.

The Flaw: Because the mountain is curved, if you take a straight step, you might accidentally step off the cliff or into a ravine that isn't actually part of the path. You are walking in "straight lines" on a curved surface, which often leads you off the track or forces you to take tiny, cautious steps (learning rates) to avoid falling.

2. The "Manifold" Way (Riemannian Gradient Descent):
Smart hikers realized they need to stay on the surface of the mountain. They use a map of the specific shape of the mountain (a "manifold").

The Flaw: But this mountain is so weirdly shaped that it doesn't look like a sphere, a cylinder, or a flat plane. It's a unique, messy shape. Trying to describe the whole mountain with just one simple map (a single "classic manifold") is impossible. It's like trying to describe a crumpled piece of paper using only a perfect sphere.

The Solution: The "Bubble" Strategy (GGD)

The authors propose a new strategy called Geodesic Gradient Descent (GGD). Instead of trying to map the whole mountain, they use a clever trick: They zoom in and pretend the ground is a perfect ball.

Here is how it works, step-by-step:

1. The Local Bubble (The n-D Sphere)
At every single step of your hike, imagine you place a giant, invisible, transparent bubble (a sphere) under your feet. This bubble is tangent to the mountain, meaning it just touches the ground at your exact spot.

Why? Even if the mountain is twisted and ugly, if you zoom in close enough, a tiny patch of it looks like a smooth curve. A sphere is the perfect shape to approximate that tiny patch. This allows the algorithm to handle any shape of mountain, no matter how complex.

2. The Great Circle Path (The Geodesic)
Once you are inside this bubble, you don't walk in a straight line. Instead, you walk along the Great Circle (the shortest path on a sphere, like the flight path of an airplane).

The Analogy: If you are on a globe, the shortest way from New York to London isn't a straight line through the earth; it's a curve along the surface. GGD forces the AI to walk this "Great Circle" path. This ensures the AI never steps off the mountain (the hypersurface) and always follows the true geometry of the terrain.

3. No More "Learning Rate" (The Automatic Step)
In traditional hiking, you have to guess how big your step should be. If you step too big, you fall; too small, and you never get there. This guess is called the "learning rate," and it's a headache to tune.

The GGD Magic: Because you are walking on a perfect sphere, there is a natural limit to how far you can go before you start going back up the other side. The authors realized that the maximum safe step is exactly one-quarter of the circle's arc.
The Result: You don't need to guess the step size anymore. The geometry of the sphere tells you exactly how far to walk. The algorithm is "learning-rate-free." It just takes the biggest possible step that keeps you on the path.

Why Does This Matter? (The Results)

The paper tested this "Bubble Hiker" against the old "Straight-Line Hikers" (like Adam and SGD) on two types of tasks:

Predicting Fluid Flow (Regression): Like predicting how a shockwave moves through a tube.
- Result: The Bubble Hiker found the solution much faster and more accurately, reducing errors by up to 48% compared to the best existing methods.
Recognizing Handwritten Digits (Classification): Like the famous MNIST dataset.
- Result: The Bubble Hiker got higher accuracy and lower errors, proving it's better at navigating the complex "mountains" of deep learning.

The Takeaway

Think of Geodesic Gradient Descent as a hiker who stops trying to force the mountain to look like a flat map. Instead, they carry a portable, inflatable bubble. They step inside the bubble, walk the perfect curve along the bubble's surface, and take the maximum possible step allowed by the bubble's size.

This allows them to navigate the most twisted, complex, and confusing terrains in AI without getting lost, without falling off cliffs, and without needing to guess how big their steps should be. It's a smarter, more natural way to train AI.

Here is a detailed technical summary of the paper "Geodesic Gradient Descent: A Generic and Learning-rate-free Optimizer on Objective Function-induced Manifolds".

1. Problem Statement

Current optimization algorithms in deep learning face two primary limitations regarding the geometry of objective functions:

Euclidean Gradient Descent (e.g., SGD, Adam): These methods operate in flat Euclidean space. They calculate gradients that often point off the curved hypersurface induced by the objective function. Consequently, update trajectories risk leaving the manifold, ignoring intrinsic geometric properties like curvature and torsion.
Standard Riemannian Gradient Descent: While these methods project gradients onto a manifold to stay on the surface, they typically rely on a single, pre-defined classic manifold (e.g., a sphere or Grassmannian). This approach fails to represent the complex, dynamic geometry of objective function-induced hypersurfaces, which are determined simultaneously by neural network parameters, input data, and the loss function expression. Furthermore, existing methods still require manual tuning of learning rates.

2. Methodology: Geodesic Gradient Descent (GGD)

The authors propose Geodesic Gradient Descent (GGD), a generic, learning-rate-free optimizer that approximates the local geometry of the objective function hypersurface dynamically at every iteration.

Core Mechanism

Instead of assuming a fixed global manifold, GGD approximates a local neighborhood of the hypersurface using an $n$ -dimensional sphere tangent to the current point. The algorithm proceeds as follows:

Hypersurface Approximation: At iteration $t$ , given a point $P_t$ on the hypersurface (defined by parameters $\theta_t$ and loss $L$ ), the algorithm calculates the Euclidean gradient $g_t$ .
Normal and Tangent Vectors:
- A normal vector $n_t$ is derived from the gradient: $n_t = (\frac{\partial L}{\partial \theta}, -1)$ .
- A tangent vector $v_t$ is constructed to lie on the tangent space of the hypersurface: $v_t = \text{concat}(g_t, \|g_t\|^2)$ .
Sphere Construction: An $n$ -dimensional sphere with radius $R_t$ is constructed, tangent to the hypersurface at $P_t$ along the direction of $-n_t$ . The center of this sphere is calculated to facilitate projection.
Geodesic Projection: The tangent vector $v_t$ is projected onto this local sphere to form a geodesic (the shortest path on the sphere). The length of this geodesic is set equal to the norm of the tangent vector.
Parameter Update: The endpoint of the geodesic becomes the new parameter set $\theta_{t+1}$ . This ensures the update trajectory remains strictly on the approximated hypersurface.

Learning Rate Elimination

A key innovation is the removal of the learning rate ( $\eta$ ).

The maximum step size in GGD is constrained by the geometry of the sphere.
The algorithm scales the tangent vector such that the maximum arc length traveled is one-quarter of the sphere's circumference ( $\pi R_t / 2$ ).
The radius $R_t$ decays over time using a Radial Basis Function (RBF): $R_t = R_0 \cdot e^{-(t-\mu)^2 / 2\sigma^2}$ . This decay mimics the natural reduction of step size as the optimizer approaches a minimum, eliminating the need for manual learning rate scheduling.

3. Key Contributions

Generic Manifold Approximation: The use of a local $n$ -dimensional sphere allows the algorithm to adapt to arbitrarily complex geometries of objective functions, overcoming the limitations of single-manifold constraints.
Learning-Rate-Free Optimization: By deriving the step size directly from the geometric properties of the approximated sphere (specifically, the arc length constraint), GGD removes the hyperparameter of the learning rate.
Performance Superiority: The method demonstrates significant improvements in both regression and classification tasks compared to state-of-the-art optimizers like Adam, SGD, and Spherical SGD (SSGD).

4. Experimental Results

The authors evaluated GGD against six optimizers (SGD, SGDM, Adam, Muon, SSGD, and GGD) on two datasets:

A. Regression: Burgers' Flow Field Dataset

Task: Solving a 1-D nonlinear partial differential equation using Fully Connected Networks (FCNs).
Results:
- GGD achieved Test MSE reductions ranging from 35.79% to 48.76% compared to Adam across different network depths.
- GGD showed significantly lower validation loss fluctuations and better stability in deeper networks compared to Adam.

B. Classification: MNIST Dataset

Task: Handwritten digit recognition using Convolutional Neural Networks (CNNs).
Results:
- GGD achieved the lowest Cross-Entropy (CE) loss and highest accuracy among all tested algorithms.
- Compared to Adam, GGD achieved CE reductions of 3.14% to 11.59% on the test set.
- Notably, SSGD (which uses a fixed spherical constraint) performed poorly, confirming that the complex geometry of the CE loss surface cannot be captured by a single static manifold, whereas GGD's dynamic approximation succeeds.

C. Efficiency

Training time analysis showed that GGD's efficiency improves as network depth increases. In deeper networks (e.g., FCN 3 with 10 hidden layers), GGD was faster than Adam, Muon, and SSGD.

5. Significance

Theoretical Advancement: GGD bridges the gap between Euclidean and Riemannian optimization by providing a mechanism to handle complex, non-static manifolds without requiring explicit manifold definitions.
Practical Utility: By eliminating the learning rate, GGD reduces the hyperparameter tuning burden, making deep learning optimization more robust and easier to deploy.
Future Direction: The paper suggests that the radius decay parameters ( $R_0, \sigma$ ) could eventually be derived deterministically from the curvature of the hypersurface, potentially leading to a fully parameter-free optimization algorithm.

In summary, Geodesic Gradient Descent offers a novel geometric perspective on optimization, utilizing local spherical approximations to navigate complex loss landscapes more effectively and efficiently than traditional gradient-based methods.