A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng

Published 2026-02-25

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine you've taught a robot to draw beautiful pictures of cats, sunsets, and cars. It's an artistic genius. But then you ask it a simple question: "If I push this red block, will it knock over the blue one, and where will it land?"

Suddenly, the artist becomes confused. It might draw the block floating in the air or disappearing into the wall. It has visual beauty, but it lacks common sense.

This paper introduces a massive new project called VBVR (Very Big Video Reasoning) designed to fix exactly that problem. Think of it as a "Gym for Robot Brains" that doesn't just teach them to look pretty, but to think logically about how the world moves.

Here is the breakdown in simple terms:

1. The Problem: Robots Can't "Think" in Video Yet

Current AI video models are like actors who memorize lines but don't understand the plot. They can make a video of a ball rolling down a hill, but if you ask them to change the ball's color halfway through or predict where it stops, they often fail. They are great at generating (making things up) but bad at reasoning (figuring out cause and effect).

The main reason? They haven't had enough practice. Existing training data is like a small notebook with a few doodles. These models need a library with millions of books to learn the rules of physics, logic, and space.

2. The Solution: A Massive "Logic Library" (VBVR-Dataset)

The authors built the VBVR-Dataset, which is the biggest collection of video reasoning puzzles ever created.

The Scale: It contains over 1 million video clips. To put that in perspective, it's about 1,000 times larger than any previous dataset. It's like upgrading from a small local library to the entire Library of Congress.
The Structure: Instead of just random videos, they organized these puzzles into 5 "Mental Muscles" based on how human brains work:
1. Perception: Can you spot the red ball in a crowd? (Seeing)
2. Spatiality: Can you navigate a maze without hitting the walls? (Moving)
3. Transformation: Can you imagine a cube spinning in your head? (Rotating)
4. Abstraction: Can you figure out the pattern in a sequence? (Pattern matching)
5. Knowledge: Do you know that water flows down, not up? (Physics/Logic)

3. The Test: A Strict "Math Teacher" (VBVR-Bench)

Usually, when we test AI, we ask another AI (or a human) to say, "Is this video good?" This is subjective.

The VBVR-Bench is different. It uses rule-based scoring, like a math teacher grading a test.
If the task was "Move the blue block to the red door," the computer checks: Did it hit the door? Did it hit the wall? Did it take the shortest path?
There is no "maybe." It's either right or wrong. This ensures the results are honest and reproducible.

4. The Experiment: Training the "Student"

The researchers took a popular open-source video model (called Wan2.2) and trained it on this massive new dataset.

The Result: The model got significantly smarter. It went from being a clumsy artist to a logical thinker.
The "Aha!" Moment: As they fed it more data, the model started showing emergent generalization. This means it didn't just memorize the answers; it started understanding the rules. If it learned to solve a maze with 5 walls, it could solve a maze with 10 walls it had never seen before.

5. The Catch: We Still Have a Long Way to Go

Even with this massive training, the AI still isn't as smart as a human child.

The Gap: Humans score about 97% on these logic tests. The best AI model (after training) scored around 68%.
The Bottleneck: The AI struggles with long-term consistency. It might get the first step right, but by the 10th step, it forgets what it was doing. It's like a student who understands the first sentence of a story but loses the plot by the end.
Key Insight: The paper found that control is the most important thing. Before an AI can reason, it must be able to control the scene perfectly. If the AI accidentally changes the background or the object's shape while trying to move it, the reasoning breaks.

The Big Picture

This paper is a foundational step. It says: "We can't just make prettier videos; we need to teach AI how the world works."

By providing a massive, structured gym (the dataset) and a strict coach (the benchmark), they have given the AI community the tools to build robots that don't just see the world, but understand it. It's the difference between a camera that records a car crash and a detective that figures out why the crash happened.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated significant reasoning capabilities in text-based domains (math, coding, science), video reasoning remains underexplored. Current video generation models (e.g., Sora, Veo, Wan) primarily focus on visual realism and aesthetic quality, often lacking the ability to perform spatiotemporally consistent reasoning involving causality, physical dynamics, and logical continuity.

The field faces three critical bottlenecks:

Lack of Data: Existing video reasoning datasets are small-scale, often lacking training data, and do not support systematic scaling studies.
Evaluation Flaws: Current benchmarks rely heavily on "VLM-as-a-judge" (using another AI to grade the output), which introduces hallucinations and lacks verifiable, reproducible ground truth.
Theoretical Gap: There is no principled framework connecting video reasoning tasks to established human cognitive architectures.

2. Methodology

The authors introduce the VBVR Suite, a comprehensive ecosystem comprising a massive dataset, a verifiable benchmark, and a scaling study.

A. Cognitive Architecture & Taxonomy

The task design is grounded in a principled taxonomy derived from human cognitive science (Aristotle, Kant, and modern neuroscience). The authors categorize video reasoning into five foundational faculties:

Perception: Extracting structured representations (edges, colors, shapes) from sensory input.
Transformation: Manipulating mental representations (e.g., mental rotation, object tracking).
Spatiality: Understanding geometric relationships and navigation (e.g., grid paths, occlusion).
Abstraction: Distilling general rules from specific instances (e.g., pattern completion, Raven's Matrices).
Knowledge: Applying learned or innate truths (e.g., physics, object permanence, symbolic logic).

B. VBVR-Dataset (The Data)

Scale: Contains 2,015,000 images and 1,007,500 video clips, making it approximately 1,000× larger than existing benchmarks.
Structure: Covers 200 curated reasoning tasks (150 public, 50 hidden for leaderboard integrity).
Generation Pipeline:
- Tasks are implemented as parameterized generators (deterministic code) rather than static videos.
- A distributed cloud-based pipeline (AWS Lambda) generates 10,000 unique instances per training task.
- Output Format: Each sample includes a first_frame.png, prompt.txt, final_frame.png, and ground_truth.mp4 (the complete reasoning trajectory). This allows models to learn how to reason, not just the final answer.

C. VBVR-Bench (The Evaluation)

Rule-Based Scoring: Unlike VLM-as-a-judge, VBVR uses deterministic, rule-based scorers. Since tasks have unique, verifiable correct answers, evaluation is based on spatial accuracy, trajectory correctness, and logical validity.
Human Alignment: The authors validated the automatic scores against human preferences, achieving a Spearman correlation ( $\rho$ ) of > 0.9, proving the benchmark is reliable and human-aligned.
Split Strategy:
- In-Domain (ID): 50 tasks with unseen parameters but known structures.
- Out-of-Domain (OOD): 50 entirely novel task structures to test transferability.

D. Scaling Study

The authors fine-tuned the open-source Wan-2.2 model on the VBVR-Dataset to create VBVR-Wan2.2. They systematically increased training data from 0K to 500K samples to observe scaling behaviors in reasoning capabilities.

3. Key Contributions

VBVR-Dataset: The first large-scale, diverse training dataset specifically for video reasoning, spanning 200 tasks and over 1 million video clips.
VBVR-Bench: A reproducible, rule-based evaluation framework that eliminates the stochasticity of LLM judges and provides granular, interpretable diagnostics of model failures.
Cognitive Taxonomy: A unified framework organizing video reasoning tasks based on established cognitive faculties (Perception, Transformation, Spatiality, Abstraction, Knowledge).
Scaling Insights: The first systematic study demonstrating that increasing data scale in video models leads to emergent generalization in reasoning, while also identifying fundamental architectural limits.

4. Key Results

A. Benchmark Performance (Table 3)

Gap to Human: Even the strongest proprietary models (Sora 2: 0.546, Veo 3.1: 0.480) fall significantly short of human performance (0.974).
Open Source vs. Proprietary: Proprietary models generally outperform open-source baselines (e.g., Wan2.2 base: 0.371).
VBVR-Wan2.2: Fine-tuning Wan-2.2 on the VBVR dataset resulted in a new state-of-the-art with an overall score of 0.685 (an 84.6% relative improvement over the base model). It surpassed all other models, including Sora 2, in specific categories like Spatiality and Perception.

B. Scaling Behavior (Table 4)

Emergent Generalization: As training data increased from 0K to 500K, both In-Domain (ID) and Out-of-Domain (OOD) scores improved significantly (ID: 0.412 $\to$ 0.760; OOD: 0.329 $\to$ 0.610).
Saturation Point: Performance gains plateaued around 400K samples, suggesting that current video generation architectures have fundamental representational bottlenecks that data scaling alone cannot overcome.
ID vs. OOD Gap: A persistent ~15% gap remains between ID and OOD performance, indicating that models still struggle with robust systematic generalization to unseen task structures.

C. Capability Correlations

Knowledge & Spatiality: Strong positive correlation ( $\rho = 0.461$ ), supporting neuroscience theories that spatial maps (hippocampal grid cells) underpin conceptual learning.
Knowledge & Perception: Strong negative correlation ( $\rho = -0.757$ ), suggesting a trade-off between core "perceptual" knowledge and learned "propositional" knowledge.
Abstraction & Transformation: Negative correlation ( $\rho = -0.641$ ), hinting at modular brain functions where high-level abstraction may compete with low-level manipulation.

D. Qualitative Findings

Controllability First: The study highlights that controllability is a prerequisite for reasoning. Models that cannot maintain stable scenes (e.g., preserving object identity or layout) fail at reasoning tasks. VBVR training instills a "controllability-first" logic.
Emergent Behaviors: Larger models showed emergent abilities in multi-step planning and "rationalizing" (adjusting intermediate steps to fit a narrative), though they still struggle with long-horizon identity stability (e.g., agent flickering).

5. Significance

Foundation for Video Reasoning: VBVR provides the necessary infrastructure (data + benchmark) to move video research from "visual generation" to "logical reasoning."
Diagnostic Tool: The rule-based benchmark allows researchers to pinpoint exactly why a model fails (e.g., path planning vs. object tracking), enabling targeted architectural improvements.
Cognitive AI: By aligning AI tasks with human cognitive faculties, the paper bridges the gap between artificial intelligence and cognitive science, offering a path toward models that understand the physical world.
Future Directions: The results suggest that while data scaling is effective, future breakthroughs in video reasoning will likely require architectural changes (e.g., explicit state tracking, structured reasoning modules) rather than just larger datasets.

The authors have released the data, benchmark toolkit, and models publicly at video-reason.com.