VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Imagine you are trying to solve a very long, complicated mystery movie. You have a question: "Why did the woman get sucked into the vacuum cleaner?"

The Old Way (Existing AI):
Most current AI models act like a single detective who is very smart but has a rigid rulebook. They decide, "I will watch the whole movie from start to finish," or "I will only look at the middle." They stick to this plan no matter what. If they miss a tiny clue in the first 5 minutes, they might never find the answer, even if they are super-intelligent. They are like a detective who refuses to ask for help or change their mind.

The New Way (VideoChat-M1):
The researchers behind VideoChat-M1 realized that solving complex video mysteries requires a team of detectives, not just one. They built a system where multiple AI agents work together, talk to each other, and constantly change their game plan.

Here is how it works, using a simple analogy:

1. The Team of Detectives (Multi-Agent System)

Instead of one AI, imagine a squad of four different detectives:

Detective A is good at spotting big picture clues.
Detective B is great at finding specific timestamps.
Detective C is an expert in spatial relationships (where things are).
Detective D is a master of reading between the lines.

2. The "Collaborative Policy Planning" (The Game Plan)

In the old days, the team leader would just say, "Go watch the whole movie," and that was it.
In VideoChat-M1, the process is dynamic:

Step 1: Make a Plan: Each detective writes down their own strategy. Detective A says, "I'll scan the whole video." Detective B says, "I'll look for the vacuum cleaner specifically."
Step 2: Execute & Talk: They start looking. After a few minutes, they stop and talk to each other.
- Detective B says: "Hey, I found the vacuum, but I missed the part where the elf pushed the button!"
- Detective A hears this and thinks: "Oh! I need to change my plan. I should go back and look for the elf, not just the vacuum."
Step 3: Adapt: They update their strategies in real-time. This is called Collaborative Policy Planning. They don't just follow a script; they improvise based on what their teammates find.

3. The "Coach" (Multi-Agent Reinforcement Learning)

How do they get better at this teamwork? They have a Coach (the Reinforcement Learning part).

If the team solves the mystery correctly, the Coach gives them a high-five (a Reward).
If they argue uselessly or miss the point, the Coach gives them a gentle correction.
Crucially, the Coach doesn't just reward the final answer. The Coach also rewards how well they worked together. If Detective A helped Detective B find a clue, they both get points. This teaches them to be better teammates over time.

Why is this a big deal?

It's Smarter: By having different agents look at different parts of the video and share their findings, they catch clues a single AI would miss.
It's Faster: They don't waste time watching the whole movie if they find the answer in the first 10 seconds. They know when to stop and when to dig deeper.
It's Efficient: The paper shows that this team of smaller AIs (totaling about 37 billion "brain cells") beats massive, single AIs (like GPT-4o or Gemini) that have hundreds of billions of "brain cells." It's like a well-coordinated soccer team beating a giant, slow robot.

The Result

When tested on hard video questions (like long movies, complex reasoning, or finding exactly when something happened), VideoChat-M1 scored higher than the best closed-source models in the world.

In short:
VideoChat-M1 is like replacing a lone genius who refuses to listen with a highly trained, talking, adaptable team that learns from its mistakes and works together to solve the puzzle. They don't just "watch" the video; they collaborate to understand it.

Here is a detailed technical summary of the paper "VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning."

1. Problem Statement

Current video understanding frameworks, particularly those based on Multimodal Large Language Models (MLLMs), face significant limitations when processing long-duration videos or videos with complex spatial-temporal structures.

Static Policies: Existing agent-based frameworks typically rely on single, fixed, or non-learnable tool invocation policies. They adhere to pre-defined rules for selecting tools (e.g., retrieval, browsing) without adaptive learning.
Suboptimal Reasoning: These rigid policies fail to dynamically identify, track, and summarize diverse clues across different temporal scales, leading to poor performance in complex reasoning tasks like long-video QA, temporal grounding, and spatial intelligence.
Lack of Collaboration: Most systems do not effectively leverage multi-agent collaboration to refine strategies during the inference process.

2. Methodology: VideoChat-M1

The authors propose VideoChat-M1, a novel multi-agent system that replaces fixed policies with a Collaborative Policy Planning (CPP) paradigm, optimized via Multi-Agent Reinforcement Learning (MARL).

A. Collaborative Policy Planning (CPP) Pipeline

Instead of a single agent executing a static plan, VideoChat-M1 employs a group of agents ( $G = \{G_i\}$ ) that interact through a shared memory buffer ( $M$ ). The process consists of three iterative stages:

Policy Generation: Each agent autonomously generates an initial tool-invocation policy ( $P_i$ ) tailored to the user's query ( $Q$ ) and the video ( $V$ ). This decomposes the task into a sequence of tool calls (e.g., Global Sampling $\to$ Video Retrieval $\to$ Rough Browser).
Policy Execution: Agents sequentially execute their policies using a toolkit ( $T$ ) containing specialized tools (e.g., Global Sampling, Video Retrieval, Time Stamp Retrieval, Rough/Fine Browser, Spatial Tool, Grounding Tool).
Policy Communication: After each execution step, agents share intermediate results and reasoning states via the shared memory. Agents then evaluate whether to modify their current policy based on peers' insights. This allows for dynamic refinement (e.g., adding a "Video Retrieval" tool if the initial "Global Sampling" was insufficient).
- Final Decision: For multiple-choice questions, the final answer is determined by majority voting. For open-ended or grounding tasks, the best-performing agent in the group synthesizes the final answer.

B. Multi-Agent Reinforcement Learning (MARL)

To ensure robustness and optimize the collaborative process, the authors introduce a MARL framework using Group Relative Policy Optimization (GRPO).

Supervised Fine-Tuning (SFT): Agents are first pre-trained on high-quality policy plans (generated by a strong team like GPT-4o + DeepSeek-R1) to learn basic planning capabilities.
Reward Design: The MARL phase optimizes agents using a hybrid reward system ( $R = R_{res} + R_{format} + R_{col}$ $R = R_{r es} + R_{f or ma t} + R_{co l}$ ):
- Result Reward ( $R_{res}$ ): Positive for correct final answers, negative for incorrect ones.
- Format Reward ( $R_{format}$ ): Penalizes syntactically invalid tool calls or unparseable outputs.
- Collaboration Reward ( $R_{col}$ ): Uses an LLM (GPT-4o) as a reward model to evaluate the intermediate collaboration process. It rewards coherent, efficient trajectories and penalizes redundant or incoherent planning.
Optimization: GRPO is used to maximize the advantage of an agent's output relative to the group's average performance, encouraging agents to learn diverse yet cooperative strategies.
Regularization: Agent Dropout is applied during training (randomly sampling communication topologies) to prevent over-specialization and co-adaptation among agents.

3. Key Contributions

First Multi-Agent Policy Learning Framework for Video: VideoChat-M1 is the first system to replace static tool policies with a dynamic, learnable Collaborative Policy Planning (CPP) paradigm, enabling agents to adapt strategies in real-time.
Pioneering MARL for Video Agents: It introduces a novel MARL approach specifically for video understanding, utilizing a hybrid reward system that evaluates both final accuracy and the quality of intermediate multi-agent collaboration.
State-of-the-Art Performance with Efficiency: The framework achieves SOTA results across eight benchmarks while maintaining high parameter efficiency (e.g., a 37B parameter group outperforming much larger models).

4. Experimental Results

The model was evaluated on 8 benchmarks covering Long Video QA, Video Reasoning, Spatial Intelligence, and Temporal Grounding.

LongVideoBench: VideoChat-M1 achieves 82.3%, outperforming Gemini 2.5 Pro (78.7%) by 3.6% and GPT-4o (66.7%) by 15.6%.
Video-MME: Achieves 83.2%, surpassing GPT-4o (71.9%) and Gemini 1.5 Pro (75.0%).
Video-Holmes & MMR-V: Outperforms the best baselines by 14.8% and 14.3% respectively.
Spatial Intelligence (VSIBench): Exceeds Gemini 1.5 Pro by 26.5%.
Temporal Grounding (Charades-STA): Achieves a 3.0% improvement over Seed 1.5VL.
Efficiency: VideoChat-M1 uses only 69.9 frames per video (12–18% of other models) and has an inference latency of 19.8s (8–21% of baselines), demonstrating a superior efficiency-performance trade-off.
Parameter Efficiency: The 37B agent group delivers performance comparable to the Qwen3-VL-235B (using only ~15% of the parameters) and Gemini 2.5 Pro.

5. Significance

Paradigm Shift: The paper moves video understanding from "static tool invocation" to "dynamic, collaborative policy planning," proving that agents can learn to adapt their reasoning strategies based on peer feedback.
Scalability: It demonstrates that a relatively small, heterogeneous group of agents (37B total parameters) can outperform massive, closed-source models (200B+ parameters) through effective collaboration and reinforcement learning.
Generalizability: The CPP and MARL framework is applicable to diverse video tasks (reasoning, grounding, spatial analysis), suggesting a robust path forward for complex multi-modal AI systems.
Training Stability: The integration of SFT warm-starts, GRPO, and agent dropout provides a principled solution to the instability often found in multi-agent reinforcement learning.

In conclusion, VideoChat-M1 establishes a new benchmark for video understanding by leveraging the synergistic power of multi-agent collaboration and reinforcement learning to dynamically solve complex, long-form video reasoning tasks.

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

1. The Team of Detectives (Multi-Agent System)

2. The "Collaborative Policy Planning" (The Game Plan)

3. The "Coach" (Multi-Agent Reinforcement Learning)

Why is this a big deal?

The Result

1. Problem Statement

2. Methodology: VideoChat-M1

A. Collaborative Policy Planning (CPP) Pipeline

B. Multi-Agent Reinforcement Learning (MARL)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

XR and Hybrid Data Visualization Spaces for Enhanced Data Analytics

Biometric-enabled Personalized Augmentative and Alternative Communications

The People's Gaze: Co-Designing and Refining Gaze Gestures with General Users and Gaze Interaction Experts

Enhancing Tool Calling in LLMs with the International Tool Calling Dataset

Human-Centered Ambient and Wearable Sensing for Automated Monitoring in Dementia Care: A Scoping Review