Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm

Imagine you are trying to bake the perfect cake. But there's a catch: you don't just pick the ingredients yourself. You have to hire a sous-chef (the Lower Level) to mix the batter, and their mixing skills depend on the recipe you give them (the Upper Level).

Your goal is to find the perfect recipe (Upper Level) that results in the best-tasting cake. However, you can't just guess the recipe; you have to wait for the sous-chef to finish mixing perfectly before you can taste the result and adjust your recipe.

This is Bilevel Optimization. It's a "game of games" used in AI to tune hyperparameters, clean data, and design neural networks.

The Problem: The "Goldilocks" Gap

For a long time, researchers assumed the sous-chef was either:

Super Predictable (Strongly Convex): No matter what, they mix the batter in a perfect, smooth bowl. If you change the recipe slightly, their mixing changes smoothly. This is easy to calculate.
Totally Chaotic (General Convex): The mixing bowl is weird. Sometimes the batter sticks, sometimes it slides. If you change the recipe, the mixing might jump around wildly or stop making sense entirely. This is very hard to solve.

Recent research showed that if the sous-chef is in the "Totally Chaotic" category, finding the perfect recipe is practically impossible. But what if they are somewhere in between? What if they are Uniformly Convex?

Think of Uniform Convexity as a bowl that isn't perfectly round (like the predictable one) but isn't jagged either. It's a bowl that gets steeper and steeper as you move away from the center, but the "steepness" follows a specific, slightly curved rule (controlled by a number called $p$ ).

If $p=2$ , it's the perfect round bowl (Strongly Convex).
If $p=4, 6, 8$ , the bowl gets flatter in the middle and steeper on the sides. It's harder to navigate, but not impossible.

The Breakthrough: A New Map and a New Strategy

The authors of this paper realized that this "in-between" bowl (Uniform Convexity) is actually solvable, but you can't use the old maps.

1. The New Map (Implicit Differentiation Theorem)
In the old days, to figure out how to change your recipe, you needed to know exactly how the mixing bowl curved at every single point. But in this "in-between" bowl, the curve can get weird (singular), making the old math break down.

The authors invented a new mathematical lens. Instead of looking at the bowl directly, they looked at the bowl through a special filter (raising the mixing variables to a power). This filter smoothed out the weird spots, allowing them to write down a clear formula for how to adjust the recipe, even when the bowl is tricky.

2. The New Strategy (The UniBiO Algorithm)
Once they had the map, they needed a strategy to walk the path.

The Old Way: "Check the mixing, adjust the recipe, check the mixing, adjust the recipe." This is slow and expensive because checking the mixing takes a long time.
The UniBiO Way: They realized the mixing bowl doesn't change instantly when you tweak the recipe. It moves slowly.
- So, they told the AI: "Don't check the mixing every single second. Check it every few minutes (Periodic Updates)."
- In between checks, they use a "momentum" technique (like a skateboarder) to keep moving forward based on the last known good direction, rather than stopping to re-calculate everything.
- They also use a "shrinking ball" strategy for the mixing: start with a wide search area, and as you get closer to the perfect mix, shrink the area you're looking in to get more precise.

The Results: Speed and Accuracy

The paper proves that this new strategy works.

The Cost: The time it takes to find the perfect recipe depends on how "weird" the bowl is (the value of $p$ $p$ ).
- If the bowl is perfect ( $p=2$ ), it's very fast.
- If the bowl is weird ( $p=8$ ), it takes longer, but it's still guaranteed to finish in a reasonable amount of time (polynomial time), unlike the chaotic cases which might never finish.
The Proof: They tested this on fake math problems and a real-world task called Data Hypercleaning.
- The Real-World Task: Imagine you have a messy dataset where some labels are wrong (like a photo of a cat labeled "dog"). You want to teach an AI to ignore the bad labels.
- The Result: Their new algorithm (UniBiO) cleaned the data and trained the AI better and faster than all the previous methods, especially when the math was "weird" (high $p$ ).

The Big Picture

Think of this paper as finding a new way to navigate a hilly landscape.

Old View: "If the hills are too flat or too jagged, we can't drive a car."
New View: "Actually, even if the hills are a bit weird (Uniformly Convex), we can still drive if we use a special suspension system (the new math) and drive in a smart pattern (periodic updates) instead of stopping at every bump."

This opens the door for AI to solve much harder, more realistic problems that were previously thought to be too difficult to optimize.

1. Problem Definition and Motivation

Context: Bilevel optimization involves a hierarchical framework where an upper-level objective $f(x, y)$ is minimized subject to the constraint that $y$ is the solution to a lower-level optimization problem minimizing $g(x, y)$ . This is widely used in hyperparameter optimization, meta-learning, and data hyper-cleaning.

The Gap:

Existing Assumptions: Most non-asymptotic convergence guarantees for bilevel optimization rely on the lower-level function $g$ being Strongly Convex (SC) or satisfying the Polyak-Łojasiewicz (PL) condition.
The Limitation: Recent work (Chen et al., 2024) demonstrated that for general convex lower-level functions, finding a point with a small hypergradient is inherently intractable (the hyperobjective may be discontinuous or lack stationary points).
The Question: Is there an intermediate class of problems between Strong Convexity and General Convexity that allows for efficient algorithms with provable convergence?

The Proposed Class: The authors introduce Lower-Level Uniform Convexity (LLUC).

A function $g(x, \cdot)$ is $(\mu, p)$ -uniformly convex if:
$g(x, y_2) \ge g(x, y_1) + \langle \nabla_y g(x, y_1), y_2 - y_1 \rangle + \frac{\mu}{p} \|y_2 - y_1\|^p$
Here, $p \ge 2$ . When $p=2$ , this recovers Strong Convexity. As $p$ increases, the function becomes "flatter" (closer to general convexity), making the problem harder.

2. Key Challenges

Addressing bilevel optimization under LLUC presents two major theoretical hurdles:

Singular Hessian: Under standard Strong Convexity, the Hessian $\nabla_{yy}^2 g$ is positive definite and invertible, allowing the use of the standard Implicit Function Theorem. Under LLUC with $p > 2$ , the Hessian can be singular (zero eigenvalues) at the optimum, rendering standard implicit differentiation inapplicable.
Relaxed Smoothness: The standard assumption of Lipschitz continuous gradients for the lower-level function conflicts with uniform convexity on unbounded domains. The authors must develop a framework that accommodates "relaxed smoothness" conditions.

3. Methodology and Theoretical Contributions

A. Novel Implicit Differentiation Theorem

The authors derive a new theorem characterizing the differentiability and smoothness of the hyperobjective $\Phi(x) = f(x, y^*(x))$ under LLUC.

Generalized Derivative: Instead of the standard Hessian inverse, the theorem utilizes a generalized derivative with respect to the transformed variable $[y]^{\circ (p-1)}$ (element-wise power).
Hypergradient Formula:
$\nabla \Phi(x) = \nabla_x f - \nabla_{xy}^2 g \left[ \frac{d \nabla_y g}{d [y]^{\circ (p-1)}} \right]^{-1} \frac{d f}{d [y]^{\circ (p-1)}}$
Smoothness Property: The hyperobjective $\Phi$ is shown to be Hölder smooth rather than Lipschitz smooth. Specifically, the gradient satisfies:
$\|\nabla \Phi(x_1) - \nabla \Phi(x_2)\| \le L_1 \|x_1 - x_2\|^{\frac{1}{p-1}} + L_2 \|x_1 - x_2\|$
This implies that as $p$ increases, the smoothness degrades, directly impacting convergence rates.

B. The UniBiO Algorithm

The authors propose UniBiO (Uniformly Convex Bilevel Optimization), a stochastic algorithm designed specifically for this setting.

Warm-Start: An initial Epoch-SGD phase to approximate the lower-level solution.
Periodic Updates: Unlike standard two-timescale methods that update the lower-level variable at every step, UniBiO updates the lower-level variable $y$ only periodically (every $I$ iterations).
Upper-Level Update: Uses Normalized Momentum ( $m_t = \beta m_{t-1} + (1-\beta)\hat{\nabla}f$ ; $x_{t+1} = x_t - \eta \frac{m_t}{\|m_t\|}$ ) to handle the non-Lipschitz nature of the gradient.
Lower-Level Update: Uses a variant of Epoch-SGD with a shrinking ball strategy to ensure convergence under uniform convexity without bounded gradient assumptions.
Hypergradient Estimation: Utilizes a Neumann series approximation to estimate the inverse of the generalized Hessian.

4. Main Results

Theoretical Guarantees

Oracle Complexity: UniBiO achieves an oracle complexity of $\tilde{O}(\epsilon^{-5p+6})$ to find an $\epsilon$ $ϵ$ -stationary point.
- Here, $\tilde{O}$ hides polylogarithmic factors.
- Consistency Check: When $p=2$ (Strong Convexity), the bound becomes $\tilde{O}(\epsilon^{-4})$ , which matches the known optimal rates for stochastic bilevel optimization under strong convexity.
Convergence: The algorithm converges with high probability, leveraging the Hölder continuity of the optimal lower-level mapping $y^*(x)$ and the specific smoothness of the hyperobjective.

Experimental Validation

Synthetic Tasks: Experiments on synthetic bilevel problems with varying $p \in \{2, 4, 6, 8\}$ confirm that convergence slows down as $p$ increases, aligning with the theoretical complexity bounds.
Data Hyper-Cleaning: Applied to the SNLI dataset (text classification) with noisy labels.
- Setup: Lower-level problem involves $\ell_p$ -norm regularization ( $p=3$ and $p=4$ ).
- Baselines: Compared against StocBiO, TTSA, MA-SOBA, SUSTAIN, and VRBO.
- Outcome: UniBiO achieved higher training and test accuracy and demonstrated better computational efficiency (accuracy vs. runtime) compared to baselines, which often fail or converge poorly when the strong convexity assumption is violated.

5. Significance and Impact

Bridging the Gap: This work successfully identifies a tractable class of bilevel problems that interpolates between the well-understood Strong Convexity regime and the intractable General Convexity regime.
Theoretical Innovation: The development of an implicit differentiation theorem for singular Hessians via variable transformation ( $[y]^{\circ (p-1)}$ ) is a significant theoretical contribution applicable to other hierarchical optimization settings.
Practical Relevance: Many real-world problems (e.g., robust learning, specific regularized regression) exhibit uniform convexity but not strong convexity. UniBiO provides the first provable algorithm for these scenarios.
Limitations: The current algorithm requires prior knowledge of the exponent $p$ . Future work aims to design adaptive algorithms that do not require this parameter, similar to Nesterov's universal methods.

In summary, the paper provides a rigorous theoretical foundation and a practical algorithm for a broader class of bilevel optimization problems, expanding the scope of solvable problems in machine learning beyond the strict constraints of strong convexity.