Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback

Imagine you run a massive, multi-level customer service center for a giant tech company. You have thousands of incoming requests (jobs) every minute, ranging from simple questions like "What's the weather?" to incredibly complex problems like "Analyze this 50-page legal contract and write a poem about it."

Your goal is to answer every question correctly while spending as little money and time as possible.

The Setup: The Hierarchy of Experts

Your center is built like a pyramid with many floors:

Floor 1 (The Edge): These are your entry-level interns. They are fast, cheap, and work locally. They can handle simple questions easily but often get stuck on hard ones.
Middle Floors: These are senior specialists. They are smarter but cost more to keep on staff.
The Top Floor (The Oracle): This is the "God-tier" expert (like a supercomputer in the cloud or a human genius). They can solve anything perfectly, but they are incredibly expensive and slow to reach.

The Challenge: When a request comes in, you have to decide immediately: Do I let the intern try to solve it, or do I pass it up to a senior specialist?

If the intern solves it, great! You saved money. If they fail, you have to pass it up. But here's the catch: You don't know if the intern failed until the request reaches the very top floor.

The Problem: The "Black Box" Feedback

In most learning systems, if an intern makes a mistake, you get an instant "Wrong!" signal and can fix their training.

In this paper's scenario, the feedback is delayed and rare.

If the intern solves a simple question, you never know if they were right or wrong unless you send it all the way to the top to check.
If you send a hard question up to the top, you get a "Correct!" signal, but that signal has to travel all the way back down through every floor to reach the original intern.
The deeper the request goes, the harder it is to get feedback. If a request gets stuck in the middle, you might never know if the decision to send it there was good or bad.

This creates a "partial feedback" problem. The system is like a gambler playing a slot machine where the lights only turn on if you win the jackpot, and even then, the signal takes a long time to get back to the lever.

The Old Way: The "Naive" Approach

Previous methods tried to learn by saying: "If I sent a request up and got a 'Correct' signal, I'll give huge credit to the decision to send it up!"

They used a mathematical trick called Importance Weighting. Since getting a signal from the top floor is rare, they multiplied the reward by a huge number to make up for the rarity.

The Flaw: This is like trying to balance a house of cards in a hurricane. Because the signals are so rare, the "huge numbers" they use are massive. If the system gets one signal, it swings wildly. If it gets none, it freezes. As the building gets taller (more layers), the signals get rarer, and the math becomes so unstable that the system crashes or learns nothing.

The Solution: VR-Ly-EXP4 (The Smart Manager)

The authors propose a new algorithm called VR-Ly-EXP4. Think of it as a brilliant, calm manager who uses two main tools to fix the chaos:

1. The "Variance Reduction" (The Baseline)

Instead of waiting for a signal from the top to judge every single decision, the manager keeps a running average of what usually happens.

Analogy: Imagine you are guessing the weather. Instead of waiting for a satellite report from space (which takes days), you look at the barometer on your wall (the baseline).
The algorithm says: "I expect this intern to get 80% of these questions right based on history. If they get one right, I don't give them a massive bonus; I just give them a tiny nudge because I already expected it."
This removes the "noise." The system stops swinging wildly and learns steadily, even when feedback is rare.

2. The "Lyapunov Optimization" (The Budget Keeper)

The system has a strict budget. You can't send every request to the top floor, or you'll go bankrupt.

Analogy: Imagine the manager has a "Debt Meter." Every time they send a request up, the meter goes up. If the meter gets too high, the manager is forced to keep requests on the lower floors, even if they might fail, to pay down the debt.
This ensures the system doesn't just send everything to the expensive top floor. It balances the cost of sending requests up against the benefit of getting them right.

How It Works in Practice

The Interns Learn: As requests come in, the system tries different strategies (e.g., "Send hard questions to Floor 2," "Keep easy questions on Floor 1").
The Feedback Loop: When a request finally reaches the top and gets a "Correct" or "Incorrect" verdict, that signal travels back down.
The Smart Update: The algorithm uses the "Baseline" to smooth out the signal. It doesn't overreact. It gently adjusts the interns' confidence.
The Budget Check: The "Debt Meter" ensures that the system doesn't overspend on sending requests up. If the meter is high, it forces the system to be more conservative.

The Results

The paper tested this on a massive dataset with thousands of different tasks (like writing code, summarizing news, or analyzing images).

Old methods (like the "Naive" approach) got confused and unstable as the system got deeper. They either sent too many requests to the top (wasting money) or got stuck on the bottom (getting answers wrong).
The New Method (VR-Ly-EXP4) stayed calm. It learned faster, made fewer mistakes, and stayed within the budget. It figured out exactly which requests to handle locally and which to pass up, even when it rarely got to see the final result.

The Takeaway

This paper solves a problem that happens whenever you have a deep, complex system where you can't easily see the results of your early decisions. By using a "baseline" to smooth out the noise and a "budget meter" to control costs, the system learns to make smart decisions even when the feedback is sparse and delayed.

It's the difference between a chaotic gambler who bets everything on a single lucky spin, and a disciplined investor who builds a portfolio that grows steadily over time, regardless of market volatility.

1. Problem Formulation

The paper addresses the challenge of optimizing routing policies in Multi-Layer Hierarchical Inference (HI) systems. These systems consist of a hierarchy of computing nodes (e.g., edge devices, intermediate servers, cloud/oracle) where tasks (jobs) are processed.

System Dynamics: A job arrives at an entry node (Layer 1). The node can either:
1. Terminate locally: Execute the inference using a loaded model and accept the result.
2. Offload: Send the job to a node in the next layer ( $k+1$ ) for further processing.
  This continues until the job reaches the final Oracle Layer (Layer $K$ ), which provides ground-truth supervision (e.g., a human judge or a perfect cloud model).
The Core Challenge: Learning the optimal routing policy is difficult due to three specific factors:
1. Recursive Loss: The inference error of a task is defined recursively. The loss incurred at a node depends on whether it terminates locally or offloads, and if it offloads, on the subsequent decisions made by downstream nodes.
2. Partial & Policy-Dependent Feedback: Feedback (the true error) is only revealed if the job reaches the Oracle Layer. Consequently, the probability of observing a loss for a specific node is policy-dependent (it depends on the routing decisions of all downstream nodes) and depth-sensitive (it decays exponentially as the hierarchy deepens).
3. Long-Term Constraints: Routing decisions incur resource costs (communication and computation). The system must satisfy long-term average resource constraints at each node while minimizing the global inference error.

Standard importance-weighted estimators used in contextual bandits (like EXP4) fail here because the probability of observing feedback becomes vanishingly small in deep hierarchies, causing the variance of the loss estimator to explode, leading to unstable learning.

2. Methodology

The authors propose VR-Ly-EXP4, a distributed online learning framework that integrates Lyapunov Optimization with a Variance-Reduced EXP4 algorithm.

A. Lyapunov Optimization for Constraints

To handle long-term resource constraints without knowing future job arrivals, the authors use Lyapunov optimization:

Virtual Queues: A virtual queue $Q_n(t)$ is maintained for each node to track the deviation between instantaneous resource consumption and the allowed budget.
Drift-Plus-Penalty: The objective is transformed into minimizing a "drift-plus-penalty" term per time slot. This balances stabilizing the queues (satisfying constraints) with minimizing the expected inference error (performance).

B. Variance-Reduced Loss Estimation

The core technical innovation is a new loss estimator designed to handle the sparse, policy-dependent feedback:

Naive Estimator Failure: A standard importance-weighted estimator scales the observed loss by $1/\rho$ , where $\rho$ is the probability of reaching the oracle. In deep hierarchies, $\rho$ is very small, causing massive variance.
Variance-Reduced Estimator (VR): The authors introduce a task-conditioned baseline $\bar{f}$ $\overset{ˉ}{f}$ . The estimator is defined as:
$\hat{F}_{vr} = \mathbb{I}_{feedback} \frac{f - \bar{f}}{\rho} + \bar{f}$
- $\mathbb{I}_{feedback}$ : Indicator if the job reached the oracle.
- $f$ : The true loss if observed.
- $\bar{f}$ : An estimate of the expected loss based on historical data for the specific task type.
Mechanism: By subtracting the baseline $\bar{f}$ before importance weighting, the variance of the estimator is significantly reduced. The baseline is then added back to ensure the estimator remains unbiased. This allows the algorithm to learn effectively even when feedback is extremely sparse.

C. Algorithm Structure

Routing (EXP4): Each node maintains a distribution over "experts." An expert is a pair $(h, n')$ consisting of a confidence threshold $h$ and a destination node $n'$ . The algorithm updates expert weights based on the variance-reduced loss estimates.
Model Placement: Periodically (every $D$ slots), nodes update their loaded models using a greedy submodular maximization algorithm to adapt to workload shifts, maximizing local execution performance under memory constraints.

3. Key Contributions

Structured Formulation: The paper formally defines multi-layer HI as an online learning problem with recursively defined loss and policy-dependent, terminal-only feedback, a setting not covered by existing shallow or single-destination models.
Variance-Reduced Algorithm (VR-Ly-EXP4): Development of a distributed algorithm that combines Lyapunov optimization with a novel variance-reduced estimator. This estimator is proven to have lower variance than naive importance-weighted methods while maintaining unbiasedness.
Theoretical Guarantees:
- Sublinear Regret: The algorithm achieves sublinear regret ( $O(\sqrt{\Gamma})$ ) relative to the best fixed routing policy in hindsight.
- Constraint Satisfaction: The virtual queues are proven to be mean-rate stable, ensuring long-term resource constraints are met.
- Near-Optimality: The system performance approaches the optimal continuous policy up to a discretization error.
Empirical Validation: Extensive experiments on large-scale, multi-modal workloads (text and vision) demonstrate superior stability and performance compared to standard baselines.

4. Experimental Results

The authors evaluated VR-Ly-EXP4 on a benchmark derived from RouterBench and VL-RouterBench, spanning 3 to 5 layers of hierarchy with up to 31 nodes.

Performance Metrics: The algorithm was measured by Inference Error Rate and Hit Rate (the ability to route difficult tasks to the Oracle).
Key Findings:
- Superiority over Baselines: VR-Ly-EXP4 consistently outperformed static heuristics (Random, Round-Robin, Pure Local) and the non-variance-reduced baseline (Ly-EXP4).
- Stability in Deep Hierarchies: As the network depth increased (3 to 5 layers), the feedback rate for standard methods dropped drastically (e.g., from 0.0146 to 0.0002), causing them to fail. VR-Ly-EXP4 maintained a high hit rate (>0.44) and low error rate across all depths.
- Importance of Upstream Loss: An ablation study (VR-Ly-EXP4-LocalLoss) showed that incorporating the expected upstream loss (the recursive component) is crucial. Without it, intermediate nodes cannot accurately evaluate the cost of offloading.
- Model Placement: The greedy model placement strategy further improved performance, particularly when combined with adaptive routing.

5. Significance

This work is significant for the deployment of Large Language Models (LLMs) and foundation models in resource-constrained, distributed environments.

Scalability: It provides a theoretical and practical framework for scaling inference systems from edge to cloud without incurring prohibitive communication costs or latency.
Robust Learning: It solves the fundamental "sparse feedback" problem in deep hierarchical systems, enabling stable learning where traditional reinforcement learning or bandit algorithms would diverge.
Resource Efficiency: By dynamically balancing accuracy and resource usage, the system enables the efficient utilization of expensive cloud resources only for tasks that truly require them, while leveraging cheaper edge models for simpler tasks.

In summary, the paper bridges the gap between theoretical online learning and practical distributed inference systems, offering a robust solution for dynamic, multi-layer AI architectures.