Spatial Causal Prediction in Video

This paper introduces Spatial Causal Prediction (SCP), a new task paradigm and benchmark (SCP-Bench) designed to evaluate and improve video models' ability to infer unseen spatial states and causal outcomes beyond visible observations, revealing significant gaps between current AI and human intelligence in this domain.

Yanguang Zhao, Jie Yang, Shengqiong Wu, Shutong Hu, Hongbo Qiu, Yu Wang, Guijia Zhang, Tan Kai Ze, Hao Fei, Chia-Wen Lin, Mong-Li Lee, Wynne Hsu

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are watching a cooking show. The chef is mixing ingredients in a bowl. Suddenly, the video cuts to black right before the chef pours the mixture onto a plate.

The Question: "Based on what you just saw, where will the food land on the plate?"

Most current AI models are like a student who has memorized the recipe but has never actually cooked. If you ask them, "What is in the bowl?" they can tell you. But if you ask, "Where will the food go after the cut?" they often guess wrong because they don't truly understand physics or cause and effect. They just see the picture, not the story.

This paper introduces a new way to test AI called SCP (Spatial Causal Prediction). Here is the breakdown in simple terms:

1. The Problem: The "Static" vs. "Dynamic" Gap

Think of existing AI benchmarks as a photo album. They ask questions about things you can see right now: "How many bowls are there?" or "Is the knife on the left?"

  • The Limitation: Real life isn't a photo album; it's a movie. Things move, collide, and change.
  • The New Challenge: SCP asks the AI to be a fortune teller or a time traveler. It forces the AI to watch a video, stop it at a specific moment, and predict what happens next (or what happened before) based on the laws of physics and logic, not just by looking at pixels.

2. The Solution: SCP-Bench (The "Gym" for AI)

The researchers built a massive gym called SCP-Bench to train and test these AI models.

  • The Equipment: They collected 1,181 video clips from sports, driving, factories, and kitchens.
  • The Workout: They created 2,500 questions. Some ask the AI to predict the future (e.g., "Will the ball go left or right?"), and some ask it to reconstruct the past (e.g., "Which object did the person touch first?").
  • The Twist: The AI is only allowed to see part of the video. It has to fill in the missing pieces using logic.

3. The Results: The "Reality Check"

The researchers tested 23 of the smartest AI models available (including big names like GPT-5, Gemini, and Qwen). Here is what they found:

  • The Gap is Huge: Humans scored about 90% on these tests. The best AI model only scored about 66%. That's a massive gap. It's like a human chess grandmaster playing against a very good, but still amateur, computer.
  • Size Matters (But isn't everything): Bigger models generally did better, like a student with a bigger library of books. However, even the biggest models struggled with the "physics" part.
  • The "Thinking" Trap: The researchers tried forcing the AI to "think step-by-step" (like a human solving a math problem). Surprisingly, this didn't help much. It's like telling a confused student to "write down their thoughts"—if they don't understand the concept, writing it down doesn't fix the error.
  • Perception vs. Reasoning: The study found that the AI isn't failing because it can't see the video (perception). It fails because it can't reason about what it sees. It sees a ball moving up, but it doesn't "know" gravity will pull it down.

4. How to Fix It? (The "Training Wheels")

The researchers tried a few tricks to help the AI:

  • Giving Hints: When they gave the AI a text description of the future (e.g., "The ball will fall down"), the AI got much better. This suggests the AI is smart enough to use the answer if it's given the right clues, but it can't generate those clues on its own yet.
  • More Data: Simply making the models bigger helped, but it wasn't a magic bullet.

The Big Takeaway

This paper is a wake-up call. We have built AI that is amazing at describing what it sees (like a tour guide), but it is still terrible at understanding how the world actually works (like a physicist).

The Analogy:

  • Current AI: A very observant tourist who can describe the scenery perfectly but doesn't understand why the river flows downstream.
  • Human Intelligence: Someone who understands the river flows downstream because of gravity, and can predict where a leaf will end up even if they can't see it yet.

The authors are saying: "We need to stop just teaching AI to see and start teaching it to understand cause and effect." Until we do that, our self-driving cars and robots might still trip over their own feet when things get complicated.