MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

MASPOB is a novel, sample-efficient framework that optimizes prompts for Multi-Agent Systems by integrating Upper Confidence Bound bandits for budget-constrained exploration, Graph Neural Networks to model topology-induced prompt coupling, and coordinate ascent to reduce search complexity, thereby achieving state-of-the-art performance across diverse benchmarks.

Zhi Hong, Qian Zhang, Jiahang Sun, Zhiwei Shang, Mingze Kong, Xiangyi Wang, Yao Shu, Zhongxiang Dai

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you have a team of highly intelligent robots (Multi-Agent Systems) working together to solve a complex problem, like writing a piece of software, diagnosing a medical condition, or solving a difficult math puzzle. Each robot has a specific job: one is the "Researcher," one is the "Coder," and one is the "Editor."

To make them work well, you give each robot a specific set of instructions, called a Prompt. Think of these prompts as the "personality" or "mood" you set for each robot. If the Researcher is too aggressive, they might miss details. If the Editor is too lazy, they might miss typos.

The problem is that finding the perfect set of instructions for the whole team is incredibly hard. You can't just guess, because testing a new set of instructions requires running the whole team through a real task, which is slow, expensive, and uses up a limited "budget" of time and money.

This is where the paper's new method, MASPOB, comes in. It's like a super-smart coach trying to tune the team's instructions without wasting any practice time. Here is how it works, broken down into three simple concepts:

1. The "Bandit" Coach (Balancing Risk and Reward)

Imagine you are at a casino with a slot machine that has 100 levers, but you only have enough coins for 50 pulls. You want to find the lever that pays out the most, but you don't know which one it is.

  • The Old Way: You might just pull the lever that paid out well yesterday (Exploitation). But you might miss a better lever you haven't tried yet.
  • The MASPOB Way: It uses a strategy called a "Bandit." It balances trying new levers (Exploration) with sticking to the good ones (Exploitation). It asks: "Is this lever promising, or is it just unknown?" If a lever is unknown, the coach gives it a "bonus" score to encourage trying it. This ensures they find the best instructions without wasting their limited budget on bad guesses.

2. The "Graph" Map (Understanding the Team's Connection)

In a team, the robots aren't isolated. The output of the "Researcher" becomes the input for the "Coder." If the Researcher changes their style, it messes up the Coder's work.

  • The Old Way: Most optimizers treat each robot like a separate person. They tune the Researcher, then the Coder, ignoring how they affect each other. It's like tuning a car engine without checking how it connects to the transmission.
  • The MASPOB Way: It uses a Graph Neural Network (GNN). Imagine a map where every robot is a dot, and the lines connecting them show who talks to whom. The coach looks at this map. It understands that if it changes the Researcher's prompt, it must adjust the Coder's prompt to match. It learns the "shape" of the team's workflow, so it doesn't break the chain of command.

3. The "One-At-A-Time" Strategy (Avoiding the Explosion)

If you have 5 robots and 20 possible instructions for each, there are 20×20×20×20×2020 \times 20 \times 20 \times 20 \times 20 (3.2 million) possible combinations. Checking them all is impossible.

  • The Old Way: Trying to check every combination is like trying to read every book in a library to find the best one. It takes forever.
  • The MASPOB Way: It uses Coordinate Ascent. Instead of changing all 5 robots at once, it changes one robot's instructions while keeping the other four fixed. It finds the best setting for Robot A, then the best for Robot B, and so on. It's like tuning a guitar: you tighten one string, listen, then move to the next. This turns a massive, impossible mountain of choices into a simple, walkable path.

The Result

By combining these three ideas, MASPOB acts like a master conductor. It knows the music (the task), understands how the instruments (the agents) interact, and knows exactly which notes (prompts) to tweak to get a perfect symphony—all while using very few practice runs.

In short:

  • The Problem: Tuning a team of AI agents is expensive and the instructions are too interconnected to guess randomly.
  • The Solution: A smart system that uses a "map" of the team's connections and a "gambling strategy" to find the best instructions quickly, changing one agent at a time to avoid getting lost in the complexity.
  • The Outcome: The team performs significantly better, solving harder problems with less wasted time and money.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →