ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse is a multi-agent video generation framework that enables consistent shared world modeling by leveraging a large-scale CARLA dataset, a spatial concatenation strategy for multi-view coherence, and cross-agent attention mechanisms to ensure geometric and interactive consistency across agents.

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you and your friend are playing a video game together. You are both driving cars in the same virtual city. Usually, in current AI video generators, if you ask the AI to show what you see, it might show a slightly different version of the city than what it shows your friend. Maybe your friend sees a red car where you see a blue one, or the buildings are in different places. It's like two people describing the same dream, but their stories don't match up.

ShareVerse is a new AI system designed to fix this. It creates a "Shared World" where two different agents (like your car and your friend's car) can explore the same environment, and the AI guarantees that what you see matches exactly what your friend sees, even if you are looking in different directions.

Here is how it works, broken down into simple concepts:

1. The Training Ground: A Digital Sandbox

To teach the AI how to do this, the researchers couldn't just film real cars (that's too expensive and hard to sync perfectly). Instead, they built a giant digital sandbox using a simulator called CARLA.

  • The Analogy: Imagine a toy train set where you can control two trains. The researchers programmed these trains to drive around, meet at intersections, and pass each other.
  • The Data: They equipped each "train" (or agent) with four cameras: one looking forward, one backward, one left, and one right. They recorded thousands of hours of these two agents driving together in different weather (rain, sun, night) and different cities. This created a massive library of "paired" videos where the AI could learn: "When Agent A sees a tree on the left, Agent B must see that same tree on their right."

2. The "360-Degree" Trick

Most video AIs only look at one camera feed (like a single phone camera). But to understand a shared world, you need to know what's happening all around you.

  • The Analogy: Imagine trying to understand a conversation in a room by only listening to one person's voice. It's confusing. ShareVerse takes the four camera feeds (Front, Back, Left, Right) of one agent and stitches them together into one giant, panoramic video strip.
  • Why it helps: This forces the AI to understand the geometry of the whole car's surroundings at once. It ensures that if the car turns left, the "Back" camera view changes logically to match the "Front" view. It keeps the internal world consistent.

3. The "Telepathic" Connection

This is the most magical part. How does Agent A know what Agent B is doing so they don't crash into each other in the simulation?

  • The Analogy: Imagine two people wearing blindfolds, but they are holding hands. If one person moves their hand, the other feels it immediately. ShareVerse adds a special "Cross-Agent Attention" module. This acts like a telepathic link between the two agents.
  • How it works: When the AI generates a video for Agent A, it doesn't just look at Agent A's camera. It "whispers" the information from Agent B's camera into the process. If Agent B is driving a red truck into the intersection, Agent A's video will automatically generate that red truck appearing in front of them, even if Agent A hasn't turned their camera to look at it yet. They are building the same reality together.

4. The Result: A Consistent Reality

When you run ShareVerse, you get a 49-second video (which is quite long for AI) where:

  • You see the world from your car's perspective.
  • Your Friend sees the world from their car's perspective.
  • Crucially: If you both look at the same building, it looks the same. If you both see a pedestrian crossing the street, they are in the exact same spot in both videos.

Why Does This Matter?

Right now, AI video generators are great at making cool, single-person movies. But for the future of robots, self-driving cars, or multiplayer games, we need systems where multiple entities can exist in the same space without the world "glitching" or changing shape depending on who is looking.

ShareVerse is the first step toward an AI that can simulate a shared reality, where multiple intelligent agents can drive, fly, or walk together in a world that feels real and consistent for everyone involved. It's like moving from a world where everyone is dreaming alone, to a world where everyone is awake in the same dream.