Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems

Imagine you are trying to design the perfect new smartphone.

You have two goals that are tangled together:

The Upper Goal (The Boss): You want the phone to have the best possible battery life and camera quality.
The Lower Goal (The Worker): But, before you can judge the battery or camera, you first have to figure out the perfect internal wiring that makes those features work. If the wiring is bad, the battery and camera don't matter.

This is a Bilevel Optimization Problem. It's like a "Boss" trying to make a decision, but the Boss can only make a good decision if a "Worker" first solves a complex problem perfectly.

The Problem: It's Expensive and Slow

In the real world, testing these phones isn't free.

The Boss's test: Building a prototype costs $10,000.
The Worker's test: Simulating the wiring inside a supercomputer takes 10 hours and costs $5,000.

You can't just try a million different designs. You have a limited budget and time. This is where Bayesian Optimization (BO) comes in. It's like a smart, cautious explorer who uses a map to guess where the treasure is, so they don't waste steps.

However, most existing "smart explorers" only looked at the Boss's map. They assumed the Worker's job was easy and free. But in this paper, the authors say: "Wait, the Worker's job is also expensive and hard!"

The Solution: The "Information Detective" (BLJES)

The authors propose a new method called BLJES (Bilevel optimization via Lower-bound based Joint Entropy Search).

To understand how it works, let's use a Detective Analogy.

1. The Old Way (The "Guess and Check" Detective)

Old methods were like detectives who only cared about solving the final crime (the Boss's goal). They would ask the Worker, "What's the best wiring?" 100 times for every single clue they found.

Result: They wasted a lot of money on the Worker, or they made bad guesses because they didn't understand how the Worker's job influenced the final result.

2. The New Way (The "Information Detective")

The new method, BLJES, is a detective who asks a different question: "Which single test will teach me the most about both the final crime AND the wiring?"

Instead of just looking for the "best" phone, the detective looks for the "most informative" phone.

If testing a specific wiring design teaches us a lot about how to fix the battery and the camera, that's a high-value test.
If testing a design only tells us something we already know, we skip it.

How Do They Do It? (The Magic Tricks)

The paper uses some heavy math, but here are the two main "magic tricks" they use to make this work:

Trick #1: The "What If" Simulation (The Truncation)
Imagine you are guessing the winner of a race.

Normal thinking: "Who is the fastest runner?"
BLJES thinking: "If I knew the winner was exactly Runner A, what would that tell me about the other runners?"

The authors use a mathematical trick called Truncation. They pretend they already know the answer (the optimal solution) and ask, "If this were the answer, what would the data look like?" By comparing their current guess to this "perfect answer" scenario, they can calculate exactly how much "information" they would gain by running a new test.

Trick #2: The "Shadow Puppet" (Random Fourier Features)
Calculating the "perfect answer" is computationally impossible because there are too many variables. It's like trying to simulate every single atom in a phone.

The Fix: They use Random Fourier Features. Think of this as creating a "shadow puppet" of the complex problem. Instead of simulating the real, heavy 3D object, they simulate a simplified 2D shadow that moves exactly the same way. This allows them to run thousands of "What If" simulations in their head (on the computer) very quickly to find the best move.

The Result

The authors tested this new detective (BLJES) against old methods on various problems, from designing chemical reactions to optimizing energy markets.

The Outcome: The new method found better solutions faster and spent less money. It was able to balance the needs of the "Boss" and the "Worker" simultaneously, rather than treating them as separate problems.

Summary

The Problem: Solving a two-layered puzzle where both layers are expensive to test.
The Mistake: Old methods ignored the cost of the inner layer.
The Fix: A new method (BLJES) that treats the whole puzzle as one big information game. It asks, "Where should I look next to learn the most about the whole system?"
The Analogy: Instead of just trying to win the race, the detective figures out which practice run will teach them the most about both the track conditions and the runner's shoes, ensuring every step counts.

1. Problem Definition

The paper addresses Bilevel Optimization (BO) where both the upper-level and lower-level objective functions are defined by expensive black-box functions.

Formulation: The problem is formulated as:
$\max_{x \in \mathcal{X}} f(x, \theta^*(x)) \quad \text{s.t.} \quad \theta^*(x) = \arg\max_{\theta \in \Theta} g(x, \theta)$
where $f$ is the upper-level objective, $g$ is the lower-level objective, $x$ are upper-level variables, and $\theta$ are lower-level variables. The lower-level optimum $\theta^*(x)$ acts as a constraint for the upper level.
Challenges:
- Nested Structure: Evaluating the upper-level objective requires solving the lower-level optimization problem, creating a computationally expensive nested loop.
- Black-Box Nature: Both $f$ and $g$ are expensive to evaluate (e.g., via simulations like quantum-mechanical calculations or material design), and gradients are unavailable.
- Limitations of Existing Methods:
  - Standard BO approaches often assume the lower level is cheap or differentiable (requiring repeated lower-level queries or gradient approximations).
  - Existing bilevel BO methods (e.g., BILBO) rely on GP-UCB, which requires tuning exploration-exploitation parameters and lacks a unified information-theoretic criterion for both levels.

2. Methodology: BLJES

The authors propose BLJES (Bilevel optimization via Lower-bound based Joint Entropy Search). This is an information-theoretic approach that seeks to maximize the information gain regarding the optimal solutions and values of both levels simultaneously.

Core Concept: Bilevel Information Gain

The method defines the acquisition function based on the Mutual Information (MI) between the candidate observations $(y_f, y_g)$ and the set of optimal variables/values $o^* = \{x^*, \theta^*, f^*, g^*\}$ :
$\text{MI}(y_f, y_g; o^* | \mathcal{D}_t)$
Direct calculation of this MI is intractable. Therefore, the authors derive a Variational Lower Bound (LB).

Key Technical Derivations

Variational Lower Bound:
Using the non-negativity of KL divergence, the MI is lower-bounded by:
$\text{LB}(x, \theta) = \mathbb{E}_{\Omega} \left[ \log \frac{q(y_f, y_g | o^*, \mathcal{D}_t)}{p(y_f, y_g | \mathcal{D}_t)} \right]$
where $\Omega = \{y_f, y_g, f^*, g^*, x^*, \theta^*\}$ .
Truncation-Based Approximation:
To make the variational distribution $q$ tractable, the authors extend the truncation-based approach (common in Max-value Entropy Search) to the bilevel setting. They approximate the conditioning on the global optimum by imposing constraints only on the current query point:
- Condition 1: $f(x, \theta^*(x)) \leq f^*$ (Upper-level value is optimal).
- Condition 2: $g(x^*, \theta) \leq g^*$ (Lower-level value is optimal).
- Condition 3: The dataset is augmented with the "optimal" point $(x^*, \theta^*, f^*, g^*)$ as a noiseless observation.
Analytical Formulation:
The paper proves (Theorem 3.1) that under Gaussian Process (GP) priors, the conditional distributions required for the lower bound can be derived analytically using Truncated Normal Distributions. Specifically, the ratio of probabilities involves the standard normal PDF ( $\phi$ ) and CDF ( $\Phi$ ).
Computation via Monte Carlo:
- Sampling $\Omega$ : Since $x^*$ and $\theta^*$ are random variables defined by the GP posteriors, the authors use Random Fourier Features (RFF) to approximate the GPs as Bayesian linear models. This allows for efficient sampling of function paths $\tilde{f}$ and $\tilde{g}$ .
- Solving Inner Loop: The inner bilevel optimization (finding $\tilde{\theta}^*(x) = \arg\max \tilde{g}$ ) is treated as a white-box problem. Since RFF approximations are differentiable, the Implicit Function Theorem is used to compute gradients $\partial \tilde{f} / \partial x$ and $\partial \tilde{\theta}^* / \partial x$ , enabling gradient-based optimization for the acquisition function.

Extensions

Decoupled Setting: The framework is extended to scenarios where upper and lower observations can be obtained separately (not simultaneously). The acquisition function is adapted to select either $y_f$ or $y_g$ based on which provides higher information gain.
Constraint Problems: The method is extended to handle inequality constraints ( $c_U \geq 0, c_L \geq 0$ ) at both levels by incorporating constraint satisfaction into the variational distribution definition.

3. Key Contributions

First Information-Theoretic Formulation for Bilevel BO: The paper introduces the first framework that applies information-theoretic criteria (Joint Entropy Search) to bilevel problems, defining a unified "bilevel information gain."
Lower-Bound Approximation: It derives a computationally tractable lower bound for bilevel mutual information by extending truncation-based approximations, avoiding the need for expensive nested optimization loops during the acquisition step.
Unified Criterion: Unlike previous methods that treat levels separately or rely on heuristic combinations, BLJES simultaneously evaluates the benefit of reducing uncertainty in both the optimal solution $(x^*, \theta^*)$ and optimal values $(f^*, g^*)$ .
Practical Extensions: The framework naturally handles decoupled observations (separate simulators) and constrained bilevel problems.

4. Experimental Results

The authors evaluated BLJES against Random Search and BILBO (a state-of-the-art GP-UCB based bilevel BO method) across various settings:

GP Sample Paths: Tested on synthetic functions with varying length scales ( $\ell$ ). BLJES consistently outperformed BILBO and Random, showing faster regret reduction.
Benchmark Problems: Evaluated on standard bilevel benchmarks (BG, SB, SMD series). BLJES achieved superior or comparable performance, particularly in complex landscapes where BILBO struggled with parameter tuning.
Real-World Applications:
- Energy Market: Simulator-based optimization.
- Chemical Engineering: Mass flow optimization.
- Materials Science: Optimization of crystal structures ( $Fe_xNi_yCr_z$ ) under stability constraints. BLJES successfully navigated the high-dimensional discrete/continuous search space.
Decoupled & Constrained Settings: The method demonstrated robustness in decoupled observation scenarios and effectively handled inequality constraints.
Ablation Studies:
- Truncation: Removing the truncation conditions significantly degraded performance, proving the necessity of the proposed approximation.
- Sampling (K): Performance was stable across different numbers of Monte Carlo samples ( $K=10$ to $50$), with $K=30$ being sufficient.
- RFF: The use of Random Fourier Features introduced negligible error compared to exact GP sampling.

5. Significance

This paper represents a significant advancement in the field of Bayesian Optimization for hierarchical problems.

Theoretical Impact: It bridges the gap between information-theoretic BO (which has been successful in single-level and multi-objective settings) and bilevel optimization, providing a rigorous theoretical foundation for handling nested black-box constraints.
Practical Impact: By eliminating the need for repeated lower-level queries or gradient assumptions, BLJES is applicable to real-world scientific and engineering problems where simulations are expensive and gradients are unavailable (e.g., computational materials design, chemical reaction optimization).
Future Direction: It opens new avenues for solving complex hierarchical decision-making problems where traditional gradient-based bilevel methods fail due to the black-box nature of the underlying simulators.

Limitations noted by authors: Theoretical regret bounds for information-theoretic bilevel BO remain an open problem, and high-dimensional settings ( $>10$ dimensions) still pose challenges common to all BO methods.

Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems

The Problem: It's Expensive and Slow

The Solution: The "Information Detective" (BLJES)

1. The Old Way (The "Guess and Check" Detective)

2. The New Way (The "Information Detective")

How Do They Do It? (The Magic Tricks)

The Result

Summary

1. Problem Definition

2. Methodology: BLJES

Core Concept: Bilevel Information Gain

Key Technical Derivations

Extensions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank