Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

Imagine you are the manager of a busy multimedia call center. Your goal is to answer customer questions (which come in text, images, audio, or a mix of all three) as well as possible.

However, you have two major problems:

You have a limited budget: You can only spend so much money on cloud servers and so much time waiting for answers.
The workers are different: Some workers are fast but expensive (Cloud AI), while others are slow but free (your own laptop). Some are great at math, others are great at drawing, and some are just "okay" at everything.

Every time a customer calls, you have to instantly decide: Who should I send this job to? If you send a hard math problem to a cheap worker, the answer might be bad. If you send a simple question to an expensive worker, you waste money. If you guess wrong too many times, you run out of money before the day is over.

This paper introduces a smart system called M2-CMAB to solve this exact problem. Here is how it works, broken down into simple parts:

1. The "Smart Brain" (The Predictor)

The Problem: Usually, to know if a worker is good at a task, you have to ask them to do it first. But in a call center, you can't waste time asking every worker to try the job before hiring them.
The Solution: The system uses a "Frozen Brain" (a pre-trained AI model) that never changes its core knowledge. Instead of retraining the whole brain, it adds tiny, lightweight "sticker notes" (called Adapters) to it.

Analogy: Imagine a master chef (the frozen brain) who knows how to cook everything. Instead of teaching the chef a new recipe from scratch every time, you just give them a small sticky note that says, "Today's customer likes spicy food, and we only have 5 minutes." The chef instantly knows how to adjust.
Result: The system can instantly predict: "If we send this image question to the Cloud Worker, it will cost $0.50 and take 2 seconds. If we send it to the Local Worker, it will cost $0.00 but take 10 seconds and might be less accurate."

2. The "Strict Accountant" (The Constrainer)

The Problem: If you just try to get the best answer every single time, you might spend your whole budget on the first 10 customers and have nothing left for the rest of the day.
The Solution: The system has a virtual "Accountant" that keeps a running tally of your budget.

Analogy: Think of this like a Lagrange Multiplier (a fancy math term for a "budget penalty"). Imagine the Accountant is holding a leash. If you start spending too much money too fast, the Accountant tightens the leash, making expensive options look "less attractive" to the decision-maker. If you have plenty of budget, the leash loosens, and you can take risks to get better answers.
Result: It balances the urge to get the best answer right now with the need to survive until the end of the day.

3. The "Strategic Manager" (The Scheduler)

The Problem: You need to decide who gets the job right now, but you don't know what the next 1,000 calls will look like.
The Solution: The system uses a two-phase strategy called Exploration vs. Exploitation.

Phase 1 (The Training Camp): At the very start, the system tries every worker on a few different types of tasks just to learn the basics. It's like a coach letting players try every position to see who is good at what.
Phase 2 (The Game): Once it has a good idea, it mostly picks the best worker for the job (Exploitation). But, it still occasionally tries a different worker (Exploration) just in case the "best" worker has a bad day or the task is tricky.
Result: It learns on the fly, adapting to changing conditions without crashing your budget.

Why is this paper a big deal?

Most previous systems were like a blindfolded archer: they guessed which worker to pick based on simple rules (e.g., "always pick the cheapest").
This new system is like a sharpshooter with a radar:

It understands the task deeply (is it a math problem? a drawing?).
It predicts the cost and quality instantly without wasting time.
It manages the budget so it doesn't run out before the day ends.

The Bottom Line:
The researchers tested this on a mix of real-world tasks (math, diagrams, conversations) and found that their system got 14% better results than the best existing methods, all while staying strictly within the budget. It's a smarter way to run AI services so you get the best answers without breaking the bank.

1. Problem Formulation

The paper addresses the challenge of online scheduling for Multi-Modal Large Language Model (MLLM) inference in resource-constrained environments.

Context: MLLMs can process heterogeneous inputs (text, images, audio) and are deployed across diverse backends (local on-device models vs. cloud APIs). These backends vary significantly in cost, latency, and capability.
The Challenge: Requests arrive sequentially with varying modality compositions and latent reasoning difficulties. The scheduler must select the optimal backend for each request to maximize response quality (reward) while adhering to strict, irreversible, multi-dimensional budget constraints (e.g., total monetary cost and total latency).
Key Difficulties:
1. Uncertain Payoff Modeling: Accurately predicting the reward and cost of a specific task-backend pair is difficult due to stochastic decoding, system jitter, and the lack of closed-form expressions for MLLM performance.
2. Non-Myopic Budget Coupling: Decisions are irreversible; spending a budget on an early task reduces resources for future, potentially more complex tasks. Greedy strategies often exhaust budgets prematurely.
Formalization: The problem is modeled as a Multi-modal Multi-constraint Contextual Bandit with Knapsacks (M2CBwK). The goal is to maximize cumulative expected reward $\sum r_t$ subject to cumulative resource consumption $\sum \phi_t \preceq \Phi$ .

2. Methodology: M2-CMAB

The authors propose M2-CMAB (Multi-modal Multi-constraint Contextual Multi-Armed Bandit), a framework consisting of three coupled components:

A. Efficient MLLM Representation (Predictor)

To address the difficulty of representing complex multi-modal tasks without heavy computational overhead:

Frozen Backbone: The core MLLM (e.g., Qwen-VL) is kept frozen to preserve its generative capabilities and representation stability.
CLS-Attentive Pooling: Instead of using the final token (which is optimized for next-token prediction), the model prepends a [CLS] token. It uses multi-head self-attention weights from this token to pool hidden states, creating a compact, semantically faithful task representation ( $z_x$ ).
Lightweight Adapters: The task representation is concatenated with an action embedding (backend choice) and fed into lightweight, trainable adapters (MLPs). Separate adapters predict the reward and each dimension of the cost vector. This allows for action-specific estimation with minimal parameter updates.

B. Decoupled Long-Horizon Constraint Control (Constrainer)

To manage irreversible multi-dimensional budgets:

Primal-Dual Optimization: The framework uses online Lagrange multipliers ( $\lambda_t$ ) to decouple long-term constraints from per-round decisions.
Online Mirror Descent (OMD): The dual variables are updated iteratively using OMD. This dynamically adjusts the "penalty" for consuming resources, effectively balancing immediate reward against future budget availability.
Queue-Based Reformulation: The cumulative budget constraint is reformulated as a queue recursion process, allowing the scheduler to make greedy-like decisions on a per-round basis while ensuring long-term constraint satisfaction.

C. Two-Phase Scheduler

The decision-making process operates in two phases:

Initial Phase (Exploration & Estimation):
- The algorithm performs a structured exploration to estimate the optimal value ( $\hat{OPT}$ ) and the feasible range for the Lagrange multipliers ( $\Lambda$ ).
- It solves a linear programming problem using historical data to set the radius of the dual feasible set, ensuring the OMD algorithm has a valid search space.
Exploration-Exploitation Phase:
- Scoring: For each incoming task, the scheduler computes a Lagrangian score for every possible action: $S_t(a) = \hat{r}_t^a - \langle \hat{\phi}_t^a/\Phi - 1/T, \lambda_t \rangle$ .
- Sampling: Actions are selected based on a probability distribution that favors high-scoring actions but maintains a probability of exploration (controlled by a hyperparameter $\rho$ ).
- Update: After execution, the observed reward and cost are used to update the adapters and the dual variables.

3. Key Contributions

Novel Framework: Introduction of M2-CMAB, the first framework to jointly handle multi-modal contexts, multi-dimensional constraints, and online scheduling for MLLMs.
Adapter-Augmented Prediction: A novel architecture that freezes the MLLM backbone and uses CLS-attentive pooling with lightweight adapters. This achieves high-fidelity task representation with low inference overhead, avoiding the instability of full fine-tuning.
Theoretical Guarantees: The paper establishes a regret bound for M2-CMAB under multi-dimensional knapsack constraints. The regret scales sublinearly with time ( $T$ ), depending on the estimation errors of the reward and cost predictors.
Comprehensive Benchmark: Creation of a realistic trace-driven benchmark involving 5 heterogeneous backends (local and cloud), 6 diverse datasets (including composite tasks), and 7 baseline methods.

4. Experimental Results

The framework was evaluated against state-of-the-art baselines (Random, Latency-first, Money-first, BGT-planner, Threshold-based) across three budget regimes (Restricted, Normal, Generous).

Performance: M2-CMAB consistently outperformed all baselines.
- On the composite dataset, it achieved up to 14.18% higher average reward compared to the second-best baseline under the "Generous" budget regime.
- It closely tracked the performance of an Oracle (which has perfect knowledge of future rewards and costs), with a gap of less than 1.2% even in highly constrained scenarios.
Robustness: The method demonstrated robustness across different datasets (e.g., GSM8K, CoQA, InfoVQA) and budget constraints.
Ablation Studies:
- Removing the Reward Adapter caused the most significant performance drop, highlighting its critical role in decision quality.
- The framework remained stable across different initial phase ratios, though a very large initial phase reduced overall performance due to reduced exploration time.

5. Significance

This work provides a principled solution to the "edge-cloud collaboration" problem for MLLMs. By treating inference scheduling as a constrained contextual bandit problem and leveraging adapter-based representation learning, M2-CMAB enables:

Cost-Efficient Deployment: Maximizing the utility of limited budgets (monetary and latency) in real-world cloud/edge environments.
Scalability: The lightweight adapter design ensures the scheduling overhead is negligible compared to the inference itself, making it suitable for real-time systems.
Adaptability: The online learning mechanism allows the system to adapt to non-stationary environments (e.g., changing network conditions or model updates) without retraining the entire model.

In summary, M2-CMAB bridges the gap between the theoretical complexity of multi-constraint online learning and the practical demands of deploying heterogeneous MLLM services.

Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

1. The "Smart Brain" (The Predictor)

2. The "Strict Accountant" (The Constrainer)

3. The "Strategic Manager" (The Scheduler)

Why is this paper a big deal?

1. Problem Formulation

2. Methodology: M2-CMAB

A. Efficient MLLM Representation (Predictor)

B. Decoupled Long-Horizon Constraint Control (Constrainer)

C. Two-Phase Scheduler

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions