World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu

Published 2026-02-26

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to teach a robot how to make a sandwich, drive a car, or walk through a busy city. If you let the robot learn by actually doing these things in the real world, it might break the toaster, crash the car, or trip over its own feet. It's expensive, dangerous, and painfully slow.

This paper introduces NVIDIA's Cosmos-Predict2.5, a new kind of "digital twin" for the physical world. Think of it not just as a video generator, but as a hyper-realistic video game engine that runs on pure imagination.

Here is the breakdown of how it works and why it matters, using simple analogies:

1. The Core Idea: The "Dream Machine"

Previous AI models were like a student who had only read a textbook about driving but never sat in a car. They could talk about traffic lights but couldn't predict what would happen if a ball rolled into the street.

Cosmos-Predict2.5 is like a student who has watched 200 million hours of real-world video. It has seen every type of weather, every kind of robot arm movement, and every car crash. It has learned the "physics" of how the world works. Now, you can ask it: "Show me what happens if a robot tries to pick up a slippery apple in a rainy kitchen," and it will generate a video of that exact scenario, complete with realistic reflections, splashes, and gravity.

2. The Secret Sauce: How It Learned

The team didn't just dump random videos into the computer. They built a giant, automated filter factory:

The Filter: Imagine a sieve that catches only the best, most interesting clips. They threw out blurry videos, cartoons, and things that looked fake. They kept only the "good stuff" (like real robots moving real objects).
The Teacher (Cosmos-Reason1): They added a smart "teacher" AI that reads the text prompts. If you say "The robot drops the cup," the teacher makes sure the AI understands exactly what "dropping" looks like, rather than just making the cup disappear.
The Practice (Reinforcement Learning): After the AI generated a video, they had it "grade" itself against human preferences. If the video looked weird or the physics were wrong, the AI got a "bad grade" and had to try again. This is like a student taking practice tests until they ace the final exam.

3. The Two Main Tools

The paper introduces two main tools in this new family:

A. Cosmos-Predict2.5 (The Simulator)

What it does: It creates new worlds from scratch based on your instructions.
The Analogy: It's like a director who can film any movie you can imagine, but the movie is a simulation of reality. You can say, "Film a robot walking on Mars," and it generates the video.
Why it's better: It's faster, sharper, and understands instructions much better than the previous version. It can handle 2 billion or 14 billion "brain cells" (parameters), making it incredibly smart.

B. Cosmos-Transfer2.5 (The Translator)

What it does: It takes a rough sketch or a simple video and turns it into a photorealistic masterpiece.
The Analogy: Imagine you have a crayon drawing of a street. This tool acts like a magical artist who takes that crayon drawing and paints a hyper-realistic, 4K video of that street, adding cars, people, and shadows, all while keeping the exact layout you drew.
The Magic: It's 3.5 times smaller than the old version but produces higher quality. It can even make long videos without the picture getting blurry or the story making less sense (a problem called "error accumulation").

4. Why Does This Matter? (The Real-World Impact)

This isn't just about making cool videos; it's about training robots safely.

For Robots: Instead of breaking 1,000 real robot arms to learn how to fold laundry, engineers can train the robot in this "Cosmos" simulator. The robot makes all its mistakes in the digital world, learns the perfect moves, and then steps into the real world ready to work.
For Self-Driving Cars: You can't easily test a self-driving car in a blizzard in a desert. With Cosmos, you can generate a blizzard in a desert instantly. You can test the car's AI against millions of "what-if" scenarios (e.g., "What if a deer jumps out?") in seconds.
For Data: It creates "fake" data that looks so real it's indistinguishable from the real thing. This helps train other AIs without needing to film millions of hours of real footage.

5. The Big Picture

NVIDIA is releasing the source code and the models for free (open source).

Think of this as NVIDIA handing out the blueprints and the raw materials for a super-powerful simulation engine to the whole world. They want researchers, students, and companies to build their own "digital worlds" to solve real-world problems.

In a nutshell:
Cosmos-Predict2.5 is a time machine and a parallel universe generator combined. It lets us fast-forward the training of robots and self-driving cars by letting them live a thousand lifetimes in a computer before they ever touch the real world. It's the ultimate safety net and the ultimate practice ground for the future of physical AI.

World Simulation with Video Foundation Models for Physical AI

1. The Core Idea: The "Dream Machine"

2. The Secret Sauce: How It Learned

3. The Two Main Tools

A. Cosmos-Predict2.5 (The Simulator)

B. Cosmos-Transfer2.5 (The Translator)

4. Why Does This Matter? (The Real-World Impact)

5. The Big Picture

1. Problem Statement

2. Methodology

A. Data Curation Pipeline

B. Model Architecture: Cosmos-Predict2.5

C. Training Strategy

D. Cosmos-Transfer2.5 (Control-Net Framework)

3. Key Contributions

4. Results

Quantitative Benchmarks (PAI-Bench)

Application-Specific Results

5. Significance

World Simulation with Video Foundation Models for Physical AI

1. The Core Idea: The "Dream Machine"

2. The Secret Sauce: How It Learned

3. The Two Main Tools

A. Cosmos-Predict2.5 (The Simulator)

B. Cosmos-Transfer2.5 (The Translator)

4. Why Does This Matter? (The Real-World Impact)

5. The Big Picture

1. Problem Statement

2. Methodology

A. Data Curation Pipeline

B. Model Architecture: Cosmos-Predict2.5

C. Training Strategy

D. Cosmos-Transfer2.5 (Control-Net Framework)

3. Key Contributions

4. Results

Quantitative Benchmarks (PAI-Bench)

Application-Specific Results

5. Significance

More like this

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction