Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

Imagine you have a very sophisticated, high-tech coffee machine. You spent weeks programming it to brew the perfect cup of coffee for your specific taste: a little bit of sugar, a specific temperature, and a precise amount of milk. You call this your "Hyperparameter."

Now, imagine you move to a new city. Suddenly, the water quality is different, or your new roommate prefers their coffee black. Your old settings are no longer perfect.

The Old Way (Retraining): To fix this, you would have to take the machine apart, reprogram the entire circuit board, and run hundreds of test batches to find the new perfect settings. This takes days, costs a fortune, and is a huge hassle.

The New Way (This Paper's Solution): What if you could just turn a dial, and the machine instantly knew how to adjust the coffee without being rebuilt? You want it stronger? Turn the dial. You want it smoother? Turn it the other way. The machine "knows" the path between "weak" and "strong" coffee because it learned the trajectory of how the coffee changes, not just the start and end points.

This paper introduces a method called Hyperparameter Trajectory Inference (HTI). It's like teaching a neural network (a type of AI) to predict how its own behavior changes as you tweak its settings, so you can adjust it on the fly without expensive retraining.

The Core Problem: The "Black Box" Gap

Neural networks are like black boxes. You put data in, and you get an answer out. But the answer depends heavily on the "knobs" you turned before training (the hyperparameters).

In Reinforcement Learning (like training an AI to play a video game or manage cancer treatment), one knob might decide how much the AI cares about "winning" vs. "being safe."
In Regression (predicting numbers), a knob might decide if the AI should be very cautious (predicting a wide range of possibilities) or very confident (predicting a single number).

Usually, if you want to change that knob after the AI is built, you have to throw the whole thing away and start over. This paper says: "No, let's build a Surrogate Model." Think of this as a "GPS for AI behavior." Instead of just knowing where you are (the current settings) and where you want to go (the new settings), the GPS knows the entire road between them.

The Secret Sauce: "Optimal Transport" and "Least Action"

How does the AI know the road between two settings? The authors use a fancy mathematical concept called Optimal Transport.

The Analogy: Moving a Mountain of Sand
Imagine you have a pile of sand (representing the AI's current behavior) and you want to move it to a new shape (the AI's new behavior).

Simple Way: Just grab a shovel and throw sand randomly until it looks right. This is messy and inefficient.
Optimal Transport: You want to move the sand using the least amount of energy possible. You don't just move it; you move it along the smoothest, most efficient path.

The paper adds a twist: Conditional Lagrangian Optimal Transport.

Conditional: The path depends on the "context" (like the user's specific needs or the input data).
Lagrangian (Least Action): This is a physics concept. It's like a ball rolling down a hill. The ball doesn't just fall straight down; it follows the path of least resistance, shaped by the terrain (the data).

The authors teach the AI to learn the "terrain" (the landscape of possible behaviors) and the "physics" (how the behavior naturally flows from one setting to another). They do this by learning two things:

The Map (Metric): How "far" apart two behaviors are in the AI's mind.
The Gravity (Potential Energy): Where the "dense" data is. The AI is biased to stay on the "highways" where data exists, rather than wandering off into the "desert" where it has no idea what to do.

Real-World Examples from the Paper

1. Personalized Cancer Treatment

The Scenario: An AI helps doctors decide how much chemotherapy to give. One setting might prioritize killing the tumor, while another prioritizes keeping the patient's immune system strong.
The Problem: Every patient is different. A young, healthy patient might need a "kill the tumor" setting, while an elderly patient needs a "protect the immune system" setting.
The HTI Solution: Instead of training a new AI for every patient, the doctors use the HTI model. They just slide a "slider" to adjust the balance between tumor-killing and immune-protection. The AI instantly generates the perfect treatment plan for that specific patient, saving hours of computing time and potentially saving lives.

2. Predicting the Future (Quantile Regression)

The Scenario: Predicting the weather or stock prices. You might want to know: "What is the 90% chance the temperature won't exceed?" (a high bar) vs. "What is the 10% chance it will be this low?" (a low bar).
The Problem: Usually, you have to train a separate AI for every single percentage point (1%, 2%, ... 99%). That's 99 different models!
The HTI Solution: Train the AI on just the extremes (1% and 99%). The HTI model then "fills in the blanks," allowing you to ask for any percentage in between instantly. It's like having a single model that can predict the entire spectrum of uncertainty.

3. Robotic Arms

The Scenario: A robot arm needs to move a cup. Sometimes it needs to move fast (risky but quick), sometimes slow (safe but slow).
The HTI Solution: The robot can instantly adjust its "caution level" based on whether it's holding a fragile wine glass or a sturdy brick, without needing to relearn how to move.

Why This Matters

This paper is a game-changer because it moves AI from being static (fixed once trained) to dynamic (adaptable on the fly).

Speed: It turns days of retraining into seconds of calculation.
Flexibility: It allows users to tweak AI behavior to fit changing real-world needs (like a sudden change in weather or a new patient's condition).
Efficiency: It saves massive amounts of computer power and money.

In short, the authors have built a "universal remote control" for neural networks. Instead of buying a new TV (retraining the model) every time you want to change the channel (the hyperparameter), you just press a button, and the TV instantly switches to the perfect setting.

1. Problem Definition: Hyperparameter Trajectory Inference (HTI)

Neural networks (NNs) often exhibit critical behavioral trade-offs determined by hyperparameters ( $\lambda$ ), such as reward weights in Reinforcement Learning (RL) or quantile targets in regression. Once deployed, user preferences or environmental conditions may evolve, rendering the initial hyperparameter settings suboptimal.

The Challenge: Retraining an NN for every new hyperparameter setting is computationally expensive and often infeasible.
The Goal: The authors introduce Hyperparameter Trajectory Inference (HTI). The objective is to learn a surrogate model $\hat{p}(y|x, \lambda)$ that can approximate the NN's conditional output distribution $p_{\theta_\lambda}(y|x)$ across a spectrum of unobserved hyperparameter settings $\lambda \in \Lambda$ , given only sparse observations at specific anchor points $\Lambda_{obs}$ .
Core Difficulty: The dynamics $\lambda \mapsto p_{\theta_\lambda}(y|x)$ are typically non-linear and exist in complex, non-Euclidean manifolds. Simple interpolation (e.g., linear or standard flow matching) often fails to produce feasible paths that respect the underlying data geometry.

2. Methodology: Conditional Lagrangian Optimal Transport (CLOT)

The proposed solution extends Trajectory Inference (TI) and Optimal Transport (OT) to a conditional setting using Lagrangian dynamics.

A. Theoretical Framework

The method models the evolution of the probability distribution as a geodesic path on a manifold, governed by a Lagrangian cost function $L(q, \dot{q}|x) = K(q, \dot{q}|x) - U(q|x)$ , where:

$K$ is the Kinetic Energy term, defining the geometry of the manifold.
$U$ is the Potential Energy term, encoding inductive biases.
The cost between two distributions is the minimum action (geodesic) required to transport mass between them.

B. Key Components

Potential Energy ( $\hat{U}$ ) for Dense Traversal:
- To ensure inferred paths stay within high-density regions of the data (avoiding "empty" space), the authors define a potential energy term based on a kernel density estimate (Nadaraya-Watson estimator).
- $\hat{U}(q|x) = \alpha \log(\hat{p}(q|x) + \epsilon)$ . High density $\implies$ low potential $\implies$ geodesics prefer these regions.
Kinetic Energy ( $K$ ) and Metric Learning:
- The kinetic term is defined as $K = \frac{1}{2}\dot{q}^T G(q|x) \dot{q}$ , where $G$ is a learnable Riemannian metric.
- Novel Parametrization: Unlike previous works that fixed the metric or used simple rotations, the authors parametrize $G_{\theta_G}$ via eigendecomposition ( $G = R E R^T$ ). A neural network learns both the rotation matrix $R$ and the eigenvalues $E$ .
- Constraint: The eigenvalues are constrained to be positive and sum to a fixed "budget" to prevent degenerate solutions (where the metric collapses to zero) while allowing for expressive anisotropy.
Joint Learning Procedure:
- The method jointly learns the metric $G_{\theta_G}$ , the Kantorovich potentials $g_{\theta_g}$ , the transport maps $T_{\theta_T}$ , and the geodesic path generator $S_{\theta_S}$ .
- Training: Uses a min-max optimization loop. The inner loop maximizes the dual OT objective (estimating transport cost), while the outer loop minimizes the estimated cost by updating the metric $G$ .
- Amortization: To avoid expensive nested optimization during training, neural approximators are trained to predict transport maps and spline-based geodesic paths, which are then refined with a few L-BFGS steps.
Sampling (Inference):
- To generate samples for a target $\lambda_{target}$ , the model samples from the nearest observed anchor distribution, applies the learned transport map to the next anchor, and interpolates along the learned geodesic path (parameterized by a spline) to reach $\lambda_{target}$ .

3. Key Contributions

Problem Formulation: Introduction of HTI, framing the adjustment of NN behavior as a trajectory inference problem conditioned on hyperparameters.
Methodological Innovation: A general framework for Conditional Lagrangian Optimal Transport (CLOT). This is the first approach to jointly learn a data-dependent metric (kinetic term) and a density-based potential term for conditional dynamics.
Expressive Metric Learning: A novel neural parametrization of the Riemannian metric using eigendecomposition with learnable eigenvalues, enabling the model to capture complex, high-dimensional geometries without degenerate solutions.
Inductive Biases: Explicit incorporation of Least-Action (via the learned metric) and Dense Traversal (via the potential energy) biases, which are crucial for generalizing from sparse data.

4. Experimental Results

The authors evaluated the method across four distinct domains, comparing against baselines like Direct Regression, Conditional Flow Matching (CFM), Metric Flow Matching (MFM), and Neural OT (NLOT).

Synthetic Semicircles (CTI): The full method ( $K_\theta - \hat{U}$ ) most faithfully reconstructed non-Euclidean semicircular trajectories, outperforming ablations that lacked either the learned metric or the density bias.
Reinforcement Learning (Cancer Therapy & Reacher):
- Task: Interpolating between reward weights (e.g., balancing tumor reduction vs. immune system preservation).
- Result: The surrogate policy achieved the highest average rewards at unseen hyperparameters.
- Efficiency: Training the surrogate took ~15 minutes, whereas training a new PPO policy for each setting took ~3.5 hours.
Quantile Regression (Time-Series Forecasting):
- Task: Inferring intermediate quantiles ( $\tau$ ) from models trained only at extremes ( $\tau=0.01, 0.99$ ).
- Result: The method achieved the lowest Mean Squared Error (MSE) and correctly reconstructed the width and shape of prediction intervals.
Generative Modeling (Dropout):
- Task: Interpolating between diffusion models trained with different dropout rates.
- Result: The method achieved the lowest Wasserstein Distance (WD), demonstrating robustness in generative settings.

5. Significance and Impact

Deployment Flexibility: HTI enables inference-time adaptation of neural networks. Users can dynamically adjust model behavior (e.g., risk tolerance, uncertainty bounds, or robustness) without retraining, which is critical for dynamic environments like personalized medicine or autonomous systems.
Computational Efficiency: By learning a single surrogate model, the approach reduces the computational cost of hyperparameter tuning from days (retraining) to minutes (inference).
Theoretical Advancement: The work bridges Lagrangian mechanics and Optimal Transport, providing a rigorous mathematical framework for modeling complex, conditional probability flows that respect data manifolds.
Future Directions: The authors note current limitations to single continuous hyperparameters and suggest future work on extending the method to multiple or discrete hyperparameters (e.g., via Principal Curves or Hilbert Curves).

In summary, this paper presents a robust, mathematically grounded framework for creating "meta-surrogates" that allow neural networks to adapt their behavior dynamically to changing user preferences, significantly lowering the barrier to deploying flexible AI systems.

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

The Core Problem: The "Black Box" Gap

The Secret Sauce: "Optimal Transport" and "Least Action"

Real-World Examples from the Paper

Why This Matters

1. Problem Definition: Hyperparameter Trajectory Inference (HTI)

2. Methodology: Conditional Lagrangian Optimal Transport (CLOT)

A. Theoretical Framework

B. Key Components

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction