ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Imagine you and your friend are playing a video game together. You are both driving cars in the same virtual city. Usually, in current AI video generators, if you ask the AI to show what you see, it might show a slightly different version of the city than what it shows your friend. Maybe your friend sees a red car where you see a blue one, or the buildings are in different places. It's like two people describing the same dream, but their stories don't match up.

ShareVerse is a new AI system designed to fix this. It creates a "Shared World" where two different agents (like your car and your friend's car) can explore the same environment, and the AI guarantees that what you see matches exactly what your friend sees, even if you are looking in different directions.

Here is how it works, broken down into simple concepts:

1. The Training Ground: A Digital Sandbox

To teach the AI how to do this, the researchers couldn't just film real cars (that's too expensive and hard to sync perfectly). Instead, they built a giant digital sandbox using a simulator called CARLA.

The Analogy: Imagine a toy train set where you can control two trains. The researchers programmed these trains to drive around, meet at intersections, and pass each other.
The Data: They equipped each "train" (or agent) with four cameras: one looking forward, one backward, one left, and one right. They recorded thousands of hours of these two agents driving together in different weather (rain, sun, night) and different cities. This created a massive library of "paired" videos where the AI could learn: "When Agent A sees a tree on the left, Agent B must see that same tree on their right."

2. The "360-Degree" Trick

Most video AIs only look at one camera feed (like a single phone camera). But to understand a shared world, you need to know what's happening all around you.

The Analogy: Imagine trying to understand a conversation in a room by only listening to one person's voice. It's confusing. ShareVerse takes the four camera feeds (Front, Back, Left, Right) of one agent and stitches them together into one giant, panoramic video strip.
Why it helps: This forces the AI to understand the geometry of the whole car's surroundings at once. It ensures that if the car turns left, the "Back" camera view changes logically to match the "Front" view. It keeps the internal world consistent.

3. The "Telepathic" Connection

This is the most magical part. How does Agent A know what Agent B is doing so they don't crash into each other in the simulation?

The Analogy: Imagine two people wearing blindfolds, but they are holding hands. If one person moves their hand, the other feels it immediately. ShareVerse adds a special "Cross-Agent Attention" module. This acts like a telepathic link between the two agents.
How it works: When the AI generates a video for Agent A, it doesn't just look at Agent A's camera. It "whispers" the information from Agent B's camera into the process. If Agent B is driving a red truck into the intersection, Agent A's video will automatically generate that red truck appearing in front of them, even if Agent A hasn't turned their camera to look at it yet. They are building the same reality together.

4. The Result: A Consistent Reality

When you run ShareVerse, you get a 49-second video (which is quite long for AI) where:

You see the world from your car's perspective.
Your Friend sees the world from their car's perspective.
Crucially: If you both look at the same building, it looks the same. If you both see a pedestrian crossing the street, they are in the exact same spot in both videos.

Why Does This Matter?

Right now, AI video generators are great at making cool, single-person movies. But for the future of robots, self-driving cars, or multiplayer games, we need systems where multiple entities can exist in the same space without the world "glitching" or changing shape depending on who is looking.

ShareVerse is the first step toward an AI that can simulate a shared reality, where multiple intelligent agents can drive, fly, or walk together in a world that feels real and consistent for everyone involved. It's like moving from a world where everyone is dreaming alone, to a world where everyone is awake in the same dream.

1. Problem Statement

Current video generation models and world models face significant limitations in multi-agent shared world modeling.

Lack of Unified Consistency: Existing works primarily focus on single-view generation or small-scale object-level multi-view synthesis. They fail to construct a unified "shared world" where multiple independent agents interact within the same physical environment while maintaining global consistency.
Data Scarcity: There is a lack of large-scale datasets containing synchronized, paired multi-view video data for multiple agents interacting in the same spatio-temporal context. Real-world data collection is costly and difficult to synchronize perfectly.
Geometric & Temporal Gaps: Existing methods struggle to ensure that:
1. Multiple views (front, rear, left, right) of a single agent are geometrically consistent.
2. Different agents perceive the same world state (e.g., the position of another agent) consistently, even in overlapping and non-overlapping regions.

2. Methodology: ShareVerse

ShareVerse is a video generation framework built upon the CogVideoX architecture (a Diffusion Transformer with a 3D VAE) designed to solve the multi-agent consistency problem.

A. Dataset Construction (CARLA-based)

To address the data gap, the authors constructed a novel large-scale dataset using the CARLA simulation platform (Unreal Engine 4).

Scale & Diversity: Contains 55,000 pairs of synchronized videos across diverse weather conditions and scenes (skyscrapers, residential areas, etc.).
Multi-View Setup: Each of the two interacting agents is equipped with four synchronized cameras (front, rear, left, right), providing a 360° view.
Interactive Trajectories: Six predefined trajectory pairs (e.g., straight meeting, turning meeting) ensure dynamic interaction between agents.
Preprocessing: Videos are clipped into 49-frame segments. Camera intrinsics and poses are converted into raymaps (6-channel representations of ray direction and origin) to serve as conditional inputs for the model.

B. Model Architecture

The core innovation lies in modifying the pretrained CogVideoX model to handle multi-agent interactions:

Spatial Concatenation (Intra-Agent Consistency):
- To ensure geometric consistency for a single agent, the four-view videos (front, rear, left, right) are spatially concatenated into a single video stream.
- This allows the model to synchronously generate multi-view frames, guaranteeing that the internal geometry of the agent's environment remains consistent across all views.
Cross-Agent Attention Blocks (Inter-Agent Consistency):
- Mechanism: After the raymap encoder, the model introduces Cross-Agent Attention blocks.
- Process: Features from two agents ( $F_1, F_2$ ) are reshaped, concatenated along the frame dimension, and flattened. Rotary Position Embeddings (RoPE) are applied to the combined sequence.
- Function: These blocks enable the interactive transmission of spatial-temporal information between agents. This ensures that:
  - Overlapping regions: Both agents generate consistent views of the same object/scene.
  - Non-overlapping regions: Agents generate reasonable content based on historical context and the shared world state.
- Output: The aggregated features are projected back to individual agent dimensions and passed through the standard DiT blocks.
Camera Trajectory Conditioning:
- Camera poses are converted to raymap embeddings (via a raymap encoder) and added element-wise to video features. This conditions the generation on precise camera trajectories, allowing agents to navigate and explore the world.

3. Key Contributions

Novel Research Task: ShareVerse pioneers the task of multi-agent shared world modeling via video generation, moving beyond single-agent or object-centric generation.
Large-Scale Interactive Dataset: The creation of a dataset with 55k pairs of synchronized, multi-view videos featuring diverse interactive trajectories, filling a critical gap in training data for this domain.
Innovative Architecture:
- Spatial Concatenation: Solves intra-agent multi-view geometric consistency.
- Cross-Agent Attention: Solves inter-agent global consistency, allowing agents to perceive each other's positions and share world information dynamically.
High-Fidelity Generation: The model supports 49-frame generation with accurate dynamic agent positioning and global scene unification.

4. Experimental Results

The authors evaluated ShareVerse on both quantitative metrics and qualitative visualizations.

Quantitative Metrics:
- Image Reconstruction: Achieved strong scores in PSNR (20.76), SSIM (0.6656), and LPIPS (0.2791) on paired frames against ground truth. The authors note that PSNR is slightly limited because the model generates new scene information not present in the first frame (dynamic exploration), which introduces valid diversity.
- VBench Benchmark: Demonstrated high performance in Aesthetic Quality (0.4480), Temporal Flickering (0.9490), Motion Smoothness (0.9745), and Subject/Background Consistency (0.8913 / 0.9312).
Qualitative Results:
- Internal Consistency: The four views of a single agent maintain strong geometric alignment.
- Shared World Consistency: Two agents exploring the same scene generate consistent relative positions, scales, and perspectives.
- Dynamic Perception: The model accurately perceives and generates the position of the other agent, even when that agent moves dynamically (verified by modifying trajectories in ablation tests).
Ablation Studies:
- Multi-view vs. Single-view: Training on 4-view concatenated videos significantly outperformed single-view (front-only) training, proving the necessity of comprehensive world information for interaction.
- Raymap vs. Raw Values: Raymap processing was shown to be superior for position perception compared to raw camera values.
- Cross-Agent Attention: Removing this module degraded the consistency of the shared world, confirming its critical role.

5. Significance and Future Impact

ShareVerse represents a significant step forward in Embodied AI and World Models.

Foundation for Collaboration: It provides the technical foundation for applications requiring multi-agent coordination, such as multiplayer games, drone swarms, and multi-robot collaboration, where agents must share a consistent understanding of the environment.
Bridging the Gap: By integrating large video models with specific architectural changes (attention and concatenation), it bridges the gap between existing generative AI and complex, interactive physical world simulations.
Future Directions: The authors suggest extending this framework to scenarios with complex physical interactions (collisions, physics) and developing real-time, long-term consistent shared world models.

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

1. The Training Ground: A Digital Sandbox

2. The "360-Degree" Trick

3. The "Telepathic" Connection

4. The Result: A Consistent Reality

Why Does This Matter?

1. Problem Statement

2. Methodology: ShareVerse

A. Dataset Construction (CARLA-based)

B. Model Architecture

3. Key Contributions

4. Experimental Results

5. Significance and Future Impact

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach