MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

Imagine you have a team of highly intelligent robots (Multi-Agent Systems) working together to solve a complex problem, like writing a piece of software, diagnosing a medical condition, or solving a difficult math puzzle. Each robot has a specific job: one is the "Researcher," one is the "Coder," and one is the "Editor."

To make them work well, you give each robot a specific set of instructions, called a Prompt. Think of these prompts as the "personality" or "mood" you set for each robot. If the Researcher is too aggressive, they might miss details. If the Editor is too lazy, they might miss typos.

The problem is that finding the perfect set of instructions for the whole team is incredibly hard. You can't just guess, because testing a new set of instructions requires running the whole team through a real task, which is slow, expensive, and uses up a limited "budget" of time and money.

This is where the paper's new method, MASPOB, comes in. It's like a super-smart coach trying to tune the team's instructions without wasting any practice time. Here is how it works, broken down into three simple concepts:

1. The "Bandit" Coach (Balancing Risk and Reward)

Imagine you are at a casino with a slot machine that has 100 levers, but you only have enough coins for 50 pulls. You want to find the lever that pays out the most, but you don't know which one it is.

The Old Way: You might just pull the lever that paid out well yesterday (Exploitation). But you might miss a better lever you haven't tried yet.
The MASPOB Way: It uses a strategy called a "Bandit." It balances trying new levers (Exploration) with sticking to the good ones (Exploitation). It asks: "Is this lever promising, or is it just unknown?" If a lever is unknown, the coach gives it a "bonus" score to encourage trying it. This ensures they find the best instructions without wasting their limited budget on bad guesses.

2. The "Graph" Map (Understanding the Team's Connection)

In a team, the robots aren't isolated. The output of the "Researcher" becomes the input for the "Coder." If the Researcher changes their style, it messes up the Coder's work.

The Old Way: Most optimizers treat each robot like a separate person. They tune the Researcher, then the Coder, ignoring how they affect each other. It's like tuning a car engine without checking how it connects to the transmission.
The MASPOB Way: It uses a Graph Neural Network (GNN). Imagine a map where every robot is a dot, and the lines connecting them show who talks to whom. The coach looks at this map. It understands that if it changes the Researcher's prompt, it must adjust the Coder's prompt to match. It learns the "shape" of the team's workflow, so it doesn't break the chain of command.

3. The "One-At-A-Time" Strategy (Avoiding the Explosion)

If you have 5 robots and 20 possible instructions for each, there are $20 \times 20 \times 20 \times 20 \times 20$ (3.2 million) possible combinations. Checking them all is impossible.

The Old Way: Trying to check every combination is like trying to read every book in a library to find the best one. It takes forever.
The MASPOB Way: It uses Coordinate Ascent. Instead of changing all 5 robots at once, it changes one robot's instructions while keeping the other four fixed. It finds the best setting for Robot A, then the best for Robot B, and so on. It's like tuning a guitar: you tighten one string, listen, then move to the next. This turns a massive, impossible mountain of choices into a simple, walkable path.

The Result

By combining these three ideas, MASPOB acts like a master conductor. It knows the music (the task), understands how the instruments (the agents) interact, and knows exactly which notes (prompts) to tweak to get a perfect symphony—all while using very few practice runs.

In short:

The Problem: Tuning a team of AI agents is expensive and the instructions are too interconnected to guess randomly.
The Solution: A smart system that uses a "map" of the team's connections and a "gambling strategy" to find the best instructions quickly, changing one agent at a time to avoid getting lost in the complexity.
The Outcome: The team performs significantly better, solving harder problems with less wasted time and money.

1. Problem Statement

The paper addresses the challenge of optimizing prompts in Large Language Model (LLM) based Multi-Agent Systems (MAS). While MASs can outperform monolithic models by decomposing complex tasks into coordinated agent interactions, their performance is highly sensitive to the specific prompts assigned to each agent.

The authors identify three critical challenges in optimizing these systems, particularly in real-world scenarios where the workflow topology (the structure of agent interactions) is fixed due to safety, compliance, or expert validation requirements:

Prohibitive Evaluation Costs: Evaluating a prompt configuration requires running the entire multi-agent workflow end-to-end, involving multiple LLM calls. This creates a strict budget constraint on the number of evaluations (samples) available for optimization.
Topology-Induced Coupling: Prompts are not independent. Changing an upstream agent's prompt alters the input distribution for downstream agents. This creates a non-separable objective function where independent optimization of individual agents leads to instability.
Combinatorial Explosion: The search space is the Cartesian product of all possible prompts for all agents. As the number of agents increases, the search space grows exponentially, making exhaustive search infeasible.

Existing methods fail to address this specific setting: single-agent optimizers ignore topology, and multi-stage optimizers (like MIPRO) often treat dependencies implicitly or lack topology awareness, leading to sample inefficiency.

2. Methodology: MASPOB

The authors propose MASPOB (Multi-Agent System Prompt Optimization via Bandits), a framework that integrates Graph Neural Networks (GNNs), Contextual Bandits, and Coordinate Ascent to solve the problem efficiently.

A. Topology-Aware Surrogate (GNN)

To model the structural dependencies between agents, MASPOB uses a Graph Attention Network (GAT) as a surrogate predictor.

Graph Construction: The MAS workflow is modeled as a Directed Acyclic Graph (DAG) where nodes are agents and edges represent information flow.
Node Features: Each agent's prompt is encoded into a vector embedding using a pre-trained text encoder.
Message Passing: The GAT aggregates information from neighboring agents using attention mechanisms. This allows the model to learn how prompt changes propagate through the workflow, capturing the "topology-induced coupling."
Prediction: The network outputs a predicted performance score ( $\mu(c)$ ) for a given prompt combination $c$ .

B. Uncertainty-Guided Exploration (Bandits)

To handle the limited evaluation budget, MASPOB frames the search as a Contextual Bandit problem using a Linear Upper Confidence Bound (LinUCB) strategy.

Uncertainty Quantification: An information matrix $M$ is maintained to track the diversity of evaluated prompt combinations. Uncertainty ( $\sigma(c)$ ) is calculated based on the distance of a candidate's embedding from previously evaluated points in the feature space.
Acquisition Function: The algorithm selects prompts to maximize the UCB score:
$UCB(c) = \mu(c) + \alpha \cdot \sigma(c)$
where $\mu(c)$ is the GNN's predicted performance (exploitation) and $\sigma(c)$ is the uncertainty bonus (exploration). This balances finding high-performing prompts with exploring uncharted regions of the search space.

C. Scalable Search (Coordinate Ascent)

To avoid the exponential complexity of searching the joint prompt space, MASPOB employs Coordinate Ascent.

Instead of optimizing all agents simultaneously, it iteratively optimizes one agent's prompt at a time while keeping others fixed.
For each agent $i$ , it finds the prompt $p_i$ that maximizes the UCB score given the current prompts of all other agents ( $c_{-i}$ ).
Complexity Reduction: This reduces the per-iteration complexity from $O(\prod |P_i|)$ (exponential) to $O(\sum |P_i|)$ (linear), making the search tractable.

3. Key Contributions

Formalization: The paper formally defines MAS prompt optimization as a budgeted black-box problem with topology-induced coupling and discrete combinatorial search spaces, highlighting the limitations of existing single-agent and generic multi-stage optimizers.
Novel Framework (MASPOB): It introduces a unified framework combining:
- A GNN surrogate to explicitly encode workflow topology and inter-agent dependencies.
- A LinUCB bandit strategy for sample-efficient exploration-exploitation trade-offs.
- Coordinate Ascent to decompose the combinatorial problem into tractable sub-problems.
Empirical Validation: Extensive experiments across six diverse benchmarks (QA, Code Generation, Math Reasoning) demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming strong baselines (including AFlow, MIPRO, and PromptBreeder) under strict evaluation budgets.

4. Experimental Results

Performance: MASPOB achieved an average performance of 80.58% across six benchmarks, outperforming the best baseline (MIPRO) by 1.71% and the IO baseline by 12.02%. It showed consistent improvements in reasoning-heavy tasks (HotpotQA, MATH) and structured output tasks (HumanEval, MBPP).
Sample Efficiency: Under a fixed budget of 50 validation evaluations, MASPOB converged to high-performance solutions faster than competitors, demonstrating superior sample efficiency.
Ablation Studies:
- GNN vs. MLP: Replacing the GNN with a standard MLP resulted in a 2.31% drop in average performance, confirming that explicitly modeling topology is crucial for capturing inter-agent couplings.
- Coordinate Ascent vs. Global Search: Coordinate ascent achieved performance comparable to exhaustive global search but reduced runtime by 98–99%, validating the efficiency of the decomposition strategy.
- Generalization: The method remained effective when the backbone LLM was changed from GPT-4o-mini to Qwen3-32B, indicating the gains come from better prompt coordination rather than model-specific quirks.
Complex Workflows: MASPOB maintained superior performance even on more complex, deeper workflow topologies where other methods (like MIPRO) degraded significantly.

5. Significance

This work is significant for several reasons:

Practical Applicability: It addresses a critical real-world constraint: optimizing performance in fixed-topology systems (common in regulated industries like finance and healthcare) where modifying the workflow structure is prohibited.
Efficiency: By combining GNNs with bandit optimization, it offers a scalable solution to the "curse of dimensionality" in multi-agent prompt tuning, making it feasible to optimize complex systems with limited API budgets.
Theoretical Insight: It demonstrates that treating multi-agent prompts as an unstructured vector (ignoring topology) is suboptimal. Explicitly modeling the workflow graph allows for better generalization and more stable optimization.
Future Direction: It provides a foundation for deploying topology-aware optimization in production agentic systems, bridging the gap between theoretical optimization and practical, auditable AI workflows.

MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

1. The "Bandit" Coach (Balancing Risk and Reward)

2. The "Graph" Map (Understanding the Team's Connection)

3. The "One-At-A-Time" Strategy (Avoiding the Explosion)

The Result

1. Problem Statement

2. Methodology: MASPOB

A. Topology-Aware Surrogate (GNN)

B. Uncertainty-Guided Exploration (Bandits)

C. Scalable Search (Coordinate Ascent)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems