Original authors: Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang

Published 2026-05-29✓ Author reviewed ⓘ

📖 5 min read🧠 Deep dive

Original authors: Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a group of robots how to work together to pick up apples. You have a massive video library (a dataset) showing how different teams of robots did this job in the past. Some teams picked the red apple together, others picked the green one, and some just wandered around aimlessly.

The challenge is that you cannot let the robots practice in the real world anymore; you can only teach them by watching these old videos. This is called Offline Multi-Agent Reinforcement Learning.

The Problem: The "Confused Choir"

In the past, when researchers tried to teach robots from these mixed-up videos, they made a big mistake. They treated each robot as if it were learning alone, ignoring how the others were moving.

Imagine a choir where everyone is singing different songs from the same sheet music. If you tell the soprano to sing "Song A" and the bass to sing "Song B" based on their individual habits, the result is a terrible, chaotic noise. In the robot world, this leads to miscoordination. The robots might try to pick up two different apples at the same time, or they might try to grab an apple that no one in the video ever successfully grabbed. They end up doing things that look "okay" for one robot but are disastrous for the team.

The paper calls this the "Combinatorial Mode Shift." It's like trying to build a house by mixing blueprints from a castle, a tent, and a skyscraper. The result isn't a house; it's a pile of mismatched bricks.

The Solution: OMSD (The "Conductor's Baton")

The authors propose a new method called OMSD (Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition).

Here is how it works, using a simple analogy:

1. The "Line-Up" Strategy (Sequential Decomposition)
Instead of asking every robot what it should do based on its own memory, OMSD asks them in a specific order, like a line of people waiting to enter a room.

Robot A goes first and decides, "I'm going to the red apple."
Robot B sees Robot A's decision and thinks, "Okay, since Robot A is going to the red apple, I should also go to the red apple to help."
Robot C sees both and follows suit.

By looking at what the previous robots decided, each robot learns the context of the team's plan. This prevents them from accidentally picking a different apple or wandering off.

2. The "Diffusion" Magic (The Score Function)
To make this work, the researchers use a special type of AI called a Diffusion Model. Think of this like a "noise-remover" or a "blur-clarifier."

Imagine the old videos are a bit blurry and full of static.
The Diffusion Model acts like a smart filter that knows exactly how to "denoise" the data. It doesn't just guess a random action; it calculates a "score" or a "direction" that points toward the actions the team actually took in the successful videos.
It tells the robot: "Don't go that way (that's a mistake); go this way (that's where the team succeeded)."

3. The "Central Coach" (Critic)
While the robots learn their specific moves in line, there is a "Central Coach" (a centralized critic) watching the whole team. This coach knows the total score the team gets. It tells the robots, "Hey, that red apple strategy gets a high score, keep doing that!"

Why It's Better

Previous methods tried to teach the robots by looking at their individual habits in isolation. This worked fine if everyone was doing the same thing, but failed miserably when the videos showed many different successful strategies (multimodal data).

OMSD fixes this by:

Respecting the Chain: It understands that Robot B's move depends on Robot A's move.
Staying in the Lane: It keeps the robots doing things that actually happened in the videos, preventing them from trying risky, made-up moves that don't exist in the data.
Finding the Best Path: It helps the team find the specific "mode" or strategy (like the red apple vs. the green apple) that yields the highest reward, without getting confused by the other strategies in the video library.

The Results

The authors tested this on various robot tasks, from simple games to complex physical simulations (like robots running or catching prey).

In simple tests: OMSD learned to coordinate perfectly, while other methods failed to agree on a plan.
In complex tests: OMSD consistently outperformed the best existing methods, especially when the training data was messy or showed many different ways to succeed.

In short, OMSD is like a smart conductor who doesn't just tell each musician to play their own part, but guides the whole orchestra to play in harmony by listening to the person before them and following the conductor's lead, ensuring the final performance is a hit rather than a disaster.

Technical Summary: Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition

1. Problem Statement

Offline Multi-Agent Reinforcement Learning (MARL) faces a critical challenge distinct from single-agent offline RL: the distribution shift caused by the disparity between online and offline data collection. While online MARL typically converges to a single coordinated joint policy through interactive adaptation, offline datasets are often mixtures of diverse cooperative behaviors collected from various sources. This results in highly multimodal joint behavior distributions.

Existing offline MARL methods generally fall into two categories, both of which struggle with this multimodality:

Value-based methods: These rely on Individual-Global-Maximization (IGM) and conservative value estimation. However, when agents use independent $\epsilon$ -greedy policies, they can select out-of-distribution (OOD) joint actions that are low-quality and not covered by the dataset.
Policy-based methods: These often constrain policies via behavior regularization or centralized planners. A common pitfall is the assumption that the joint behavior policy can be factorized into independent marginals ( $\mu(a|s) = \prod \mu_i(a_i|s)$ ). In multimodal settings, this independent factorization leads to "Combinatorial Mode Shift" (CMS). As agents are regularized toward their own marginal distributions, they lose alignment with the joint modes, resulting in joint policies that lie outside the high-density regions of the dataset. This misalignment causes severe distribution shifts and poor coordination.

2. Methodology: OMSD

The authors propose Offline MARL with Sequential Score Decomposition (OMSD) to address the multimodal coordination problem without requiring a full joint policy model or a centralized planner.

Core Concept: Sequential Decomposition

Instead of assuming conditional independence, OMSD factorizes the joint behavior policy using the chain rule, conditioning each agent's behavior on the actions of preceding agents:
$\mu(a|s) = \prod_{i=1}^n \mu_i(a_i | s, a_{<i})$
where $a_{<i}$ represents the joint actions of all agents preceding agent $i$ . This sequential modeling captures inter-agent dependencies and provides an exact conditional reference for each agent's policy constraints.

Algorithmic Workflow

OMSD operates under the Centralized-Training-Decentralized-Execution (CTDE) framework and consists of three main stages:

Critic Pretraining: A centralized joint value function $Q_{tot}(s, a)$ is learned using offline Implicit Q-Learning (IQL) to provide reward guidance.
Score Pretraining: For each agent $i$ $i$ , a conditional diffusion model is trained on the offline dataset to estimate the conditional score function $\nabla_{a_i} \log \mu_i(a_i | s, a_{<i})$ $\nabla_{a_{i}} lo g μ_{i} (a_{i} ∣ s, a_{< i})$ .
- Crucially, these models are trained in parallel.
- The score function approximates the gradient of the log-probability of the behavior policy, serving as a behavior regularizer.
Policy Optimization: Agents update their policies using a gradient that combines the centralized critic signal and the sequential score regularization:
$\nabla_{\theta_i} L_i = \mathbb{E} \left[ \nabla_{a_i} Q_{tot}(s, a) + \frac{1}{\beta} \nabla_{a_i} \log \mu_i(a_i | s, a_{<i}) \right] \nabla_{\theta_i} \pi_{\theta_i}$
- Sequential Conditioning: During the update of agent $i$ , the prefix actions $a_{<i}$ are sampled from the most recently updated policies of agents $1$ to $i-1$ within the same iteration.
- Execution: Despite the sequential update during training, execution remains fully decentralized. Each agent acts based on its local observation, as the sequential dependency is only used to guide the learning direction (score regularization) and not to generate actions at runtime.
- Efficiency: The method uses deterministic DiLac policies for prefix actions to avoid noise amplification and does not require iterative denoising sampling during execution, avoiding the high inference costs typical of diffusion-based actors.

3. Key Contributions

Identification of the Root Cause: The paper identifies the multimodal nature of offline joint behavior distributions and the failure of independent marginal factorization (leading to Combinatorial Mode Shift) as the primary cause of coordination failure in offline MARL.
OMSD Algorithm: The development of a novel framework that sequentially decomposes behavior policies and utilizes diffusion-based conditional scores as behavior regularizers. This approach promotes coordinated mode selection without modeling the full joint policy or relying on a centralized planner.
State-of-the-Art Performance: Extensive experiments demonstrate that OMSD consistently outperforms existing methods, particularly in challenging multimodal scenarios (e.g., medium-quality datasets).

4. Experimental Results

The authors evaluated OMSD on:

Toy Bandit Example: A 2-agent cooperative task with two optimal modes. OMSD achieved performance comparable to joint action learning (BRPO-JAL) and significantly outperformed independent learning (BRPO-IND) and naive CTDE methods, which failed to avoid OOD joint actions.
Multi-Agent Particle Environment (MPE): Tasks including Cooperative Navigation, Predator Prey, and World. OMSD achieved the best or second-best scores across Expert, Medium, and Random datasets. Notably, on "Medium" and "Random" datasets where multimodality is pronounced, OMSD showed significant gains (e.g., +70.6% on Predator Prey Random).
MaMuJoCo: High-dimensional continuous control tasks involving robot parts acting as agents (e.g., HalfCheetah, Ant). OMSD outperformed baselines like MA-CQL, CFCQL, MADiff, and DoF, especially on mixed-quality datasets (e.g., +73.9% average improvement over the strongest baseline on OMIGA datasets).

Ablation Studies:

Score Decomposition: OMSD consistently outperformed variants using independent factorization (BRPO-IND, BRPO-CTDE), confirming the necessity of sequential conditioning.
Order Sensitivity: The method was found to be robust to the order of agent updates, suggesting the sequential structure acts as a training-time coordination mechanism rather than a rigid inductive bias.
Density Estimators: Diffusion models outperformed simpler estimators (GMMs, Normalizing Flows) in capturing complex multimodal structures, particularly on expert and medium datasets.

5. Significance and Claims

The paper claims that modality-aware coordination is essential for robust offline MARL. By leveraging sequential score decomposition, OMSD successfully aligns policy updates with the true joint behavior distribution, avoiding the distribution shift caused by independent regularization.

The authors emphasize that their approach:

Avoids OOD Joint Actions: By conditioning on prefix actions, agents are guided toward high-value, in-distribution regions.
Maintains Decentralized Execution: Unlike methods requiring centralized planning or sequential execution at runtime, OMSD agents act independently during deployment.
Scalability: The pretraining of conditional score models is fully parallelizable across agents, making the method suitable for larger teams.

The work is presented as a significant step forward in handling the complexity of offline multi-agent data, specifically addressing the "Combinatorial Mode Shift" that has hindered previous policy-based approaches. The authors acknowledge limitations, such as the current focus on continuous action spaces and the dependency on the quality of the pretrained centralized critic.

Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition