A Very Big Video Reasoning Suite

To address the lack of large-scale data for studying video reasoning, this paper introduces the Very Big Video Reasoning (VBVR) suite, comprising a massive dataset of over one million video clips across 200 tasks and a verifiable benchmark, which together enable the first large-scale scaling study revealing early signs of emergent generalization in video reasoning models.

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng

Published 2026-02-25
📖 4 min read☕ Coffee break read

Imagine you've taught a robot to draw beautiful pictures of cats, sunsets, and cars. It's an artistic genius. But then you ask it a simple question: "If I push this red block, will it knock over the blue one, and where will it land?"

Suddenly, the artist becomes confused. It might draw the block floating in the air or disappearing into the wall. It has visual beauty, but it lacks common sense.

This paper introduces a massive new project called VBVR (Very Big Video Reasoning) designed to fix exactly that problem. Think of it as a "Gym for Robot Brains" that doesn't just teach them to look pretty, but to think logically about how the world moves.

Here is the breakdown in simple terms:

1. The Problem: Robots Can't "Think" in Video Yet

Current AI video models are like actors who memorize lines but don't understand the plot. They can make a video of a ball rolling down a hill, but if you ask them to change the ball's color halfway through or predict where it stops, they often fail. They are great at generating (making things up) but bad at reasoning (figuring out cause and effect).

The main reason? They haven't had enough practice. Existing training data is like a small notebook with a few doodles. These models need a library with millions of books to learn the rules of physics, logic, and space.

2. The Solution: A Massive "Logic Library" (VBVR-Dataset)

The authors built the VBVR-Dataset, which is the biggest collection of video reasoning puzzles ever created.

  • The Scale: It contains over 1 million video clips. To put that in perspective, it's about 1,000 times larger than any previous dataset. It's like upgrading from a small local library to the entire Library of Congress.
  • The Structure: Instead of just random videos, they organized these puzzles into 5 "Mental Muscles" based on how human brains work:
    1. Perception: Can you spot the red ball in a crowd? (Seeing)
    2. Spatiality: Can you navigate a maze without hitting the walls? (Moving)
    3. Transformation: Can you imagine a cube spinning in your head? (Rotating)
    4. Abstraction: Can you figure out the pattern in a sequence? (Pattern matching)
    5. Knowledge: Do you know that water flows down, not up? (Physics/Logic)

3. The Test: A Strict "Math Teacher" (VBVR-Bench)

Usually, when we test AI, we ask another AI (or a human) to say, "Is this video good?" This is subjective.

  • The VBVR-Bench is different. It uses rule-based scoring, like a math teacher grading a test.
  • If the task was "Move the blue block to the red door," the computer checks: Did it hit the door? Did it hit the wall? Did it take the shortest path?
  • There is no "maybe." It's either right or wrong. This ensures the results are honest and reproducible.

4. The Experiment: Training the "Student"

The researchers took a popular open-source video model (called Wan2.2) and trained it on this massive new dataset.

  • The Result: The model got significantly smarter. It went from being a clumsy artist to a logical thinker.
  • The "Aha!" Moment: As they fed it more data, the model started showing emergent generalization. This means it didn't just memorize the answers; it started understanding the rules. If it learned to solve a maze with 5 walls, it could solve a maze with 10 walls it had never seen before.

5. The Catch: We Still Have a Long Way to Go

Even with this massive training, the AI still isn't as smart as a human child.

  • The Gap: Humans score about 97% on these logic tests. The best AI model (after training) scored around 68%.
  • The Bottleneck: The AI struggles with long-term consistency. It might get the first step right, but by the 10th step, it forgets what it was doing. It's like a student who understands the first sentence of a story but loses the plot by the end.
  • Key Insight: The paper found that control is the most important thing. Before an AI can reason, it must be able to control the scene perfectly. If the AI accidentally changes the background or the object's shape while trying to move it, the reasoning breaks.

The Big Picture

This paper is a foundational step. It says: "We can't just make prettier videos; we need to teach AI how the world works."

By providing a massive, structured gym (the dataset) and a strict coach (the benchmark), they have given the AI community the tools to build robots that don't just see the world, but understand it. It's the difference between a camera that records a car crash and a detective that figures out why the crash happened.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →