Solaris: Building a Multiplayer Video World Model in Minecraft

The paper introduces Solaris, a multiplayer video world model for Minecraft that leverages a novel automated data collection system and a staged training pipeline to overcome the limitations of single-agent models by simulating consistent multi-view observations and complex multi-agent interactions.

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are watching a movie, but instead of just seeing the main character, you are suddenly seeing the world through the eyes of two different people at the same time.

If Player A jumps over a fence, Player B (who is standing nearby) must see that jump happen from their own angle. If Player A builds a wall, Player B must see that wall appear, even if they are looking at it from the side.

This is the challenge that the paper "Solaris" tackles. It's about teaching an AI to understand and simulate a video game world (specifically Minecraft) not just for one person, but for a whole group of friends playing together.

Here is the story of how they did it, broken down into simple concepts:

1. The Problem: The "Lonely" AI

Most current AI video generators are like solitary actors. They can imagine what happens next if one person moves, but they get confused when two people interact.

  • The Old Way: If you asked an AI to show two people playing, it might make Player A build a house, but then show Player B looking at an empty field because the AI forgot they were in the same world.
  • The Goal: Solaris wants to be a perfect director who knows exactly what everyone sees, hears, and does, keeping the story consistent for all of them.

2. The Solution: Building a "Robot Playground" (SolarisEngine)

To teach the AI, you need data. But you can't just ask humans to play for 12 million hours; that takes too long.

  • The Analogy: Imagine you want to teach a child how to play soccer. Instead of waiting for kids to show up, you build a robot soccer league. You program robots to pass the ball, run, and score goals automatically.
  • What they did: The team built a system called SolarisEngine. It's a digital factory that runs thousands of "bot" (robot) players in Minecraft. These bots play together, mining, building, and fighting. The system records everything: what the bots see, what they do, and how the world changes.
  • The Result: They collected 12.64 million frames of video. That's like watching 1,500 hours of continuous, coordinated multiplayer gameplay.

3. The Brain: The "Solaris" Model

Now they had the data, but they needed a brain to learn from it.

  • The Analogy: Think of the AI model as a student.
    • Step 1 (The Solo Student): First, they taught the student how to play Minecraft alone. This gave the student a basic understanding of gravity, blocks, and how to move.
    • Step 2 (The Team Student): Next, they showed the student the multiplayer footage. Now the student had to learn: "If I move left, my friend sees me moving right."
    • The Secret Sauce (Checkpointed Self-Forcing): This is the paper's biggest technical breakthrough.
      • The Problem: When an AI tries to predict a long movie (say, 5 minutes long) frame by frame, it gets "forgetful" and runs out of computer memory, like a student trying to remember a whole book after reading just one page.
      • The Fix: They invented a technique called Checkpointed Self-Forcing. Imagine the student is writing a long essay. Instead of trying to hold the entire essay in their head while writing the next sentence, they write a sentence, save a quick note (checkpoint) of what they just wrote, and then erase it from their short-term memory to make room for the next one. Later, they use those notes to check their work. This allows the AI to generate long, consistent videos without crashing.

4. The Test: Can the AI Pass the "Friendship Exam"?

To see if Solaris actually works, they created a series of tests that act like a friendship exam:

  • The "Where's Waldo?" Test (Memory): One player turns around and looks away. Does the other player still "know" where the first player is? If the first player turns back, does the second player see them?
  • The "Construction Site" Test (Building): One player builds a tower. Does the other player see the tower grow, even if they are looking at it from a different angle?
  • The "Weather" Test (Consistency): If it starts raining for one player, does it start raining for the other player at the exact same time?

The Result: Solaris passed these tests better than any previous model. It could generate videos where two players interacted realistically, with consistent lighting, physics, and shared memories of the world.

Why Does This Matter?

Think of this as the foundation for the future of digital worlds.

  • Right now, AI video is like a solitary dream.
  • Solaris is the first step toward a shared dream.

This technology isn't just about making cool Minecraft videos. It's a stepping stone toward:

  • Better Video Games: NPCs (non-player characters) that actually understand you and your friends.
  • Robotics: Teaching robots how to work together in a real factory or kitchen, not just as individuals.
  • Virtual Reality: Creating worlds where you and your friends can hang out, and the world reacts perfectly to everyone's actions simultaneously.

In short, Solaris taught an AI to stop being a lone wolf and start thinking like a team player, opening the door to a new generation of shared, interactive digital worlds.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →