Thinking with Spatial Code for Physical-World Video Reasoning

This paper introduces "Thinking with Spatial Code," a framework that converts RGB videos into explicit, temporally coherent 3D representations using a specialized spatial encoder and reinforcement learning, enabling large language models to achieve state-of-the-art performance in physical-world visual reasoning on VSI-Bench.

Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are watching a video of a messy living room. You see a cat jump off a sofa, run past a coffee table, and hide under a chair.

Current AI models (like the smartest chatbots today) watch this video like a human who only has eyes but no brain for geometry. They see a "brown blob" moving, then a "white blob" moving. They can describe what they see ("The cat ran"), but they often get confused about where things are in 3D space. If you ask, "Is the cat to the left or right of the table from the table's perspective?", these models might get it wrong because they are just guessing based on how the picture looks on the screen, not understanding the actual 3D room.

This paper introduces a new way of thinking called "Thinking with Spatial Code."

Here is the simple breakdown of how it works, using a few analogies:

1. The Problem: The "Blurry Photo" vs. The "Blueprint"

Imagine trying to solve a puzzle using a blurry, 2D photo of the pieces. You can guess the shapes, but you don't know exactly how deep they are or how they fit together in 3D.

Most AI tries to solve video questions by looking at the "blurry photo" (the raw video pixels). It's good at recognizing faces or colors, but terrible at understanding distance, orientation, and 3D layout.

2. The Solution: The "Architect's Blueprint"

The authors' new framework doesn't just look at the video; it translates the video into a 3D Blueprint (which they call "Spatial Code").

Think of it like this:

  • The Old Way: The AI looks at a video of a kitchen and says, "I see a stove and a fridge."
  • The New Way: The AI acts like a super-fast architect. It watches the video and instantly draws a 3D blueprint. It writes down:
    • Stove: Located at coordinates (X, Y, Z), facing North, size 2x2 meters.
    • Fridge: Located at coordinates (X+3, Y, Z), facing North, size 1x1 meters.

This "Blueprint" is a list of facts, not a picture. It turns the messy video into clean, mathematical data.

3. The Process: Two Steps to Genius

The system works in two distinct stages, like a construction crew and an interior designer working together.

Step A: The Construction Crew (The Spatial Encoder)
This is the part that watches the video. It uses special tools (like a digital version of "Segment Anything" and "Depth Anything") to:

  1. Identify objects: "That's a sofa."
  2. Track them: "The sofa stayed in the same spot while the camera moved."
  3. Measure them: "The sofa is 2 meters long and facing the TV."
  4. Output: It spits out the "Spatial Code" (the blueprint).

Step B: The Interior Designer (The Language Model)
Now, instead of giving the Language Model (the "brain") the raw video, we give it the Blueprint.

  • Question: "If I'm standing at the dishwasher facing the table, is the washer to my left or right?"
  • Old AI: Looks at the video, gets confused by the camera angle, and guesses.
  • New AI: Reads the blueprint. It sees the exact coordinates of the dishwasher, the table, and the washer. It does a quick math calculation (like a GPS) and says, "Ah, the washer is at coordinate X, which is definitely to the front-left."

4. The Secret Sauce: "The Rubric" (The Strict Teacher)

The authors found that even with the blueprint, the AI sometimes makes "logic errors." It might calculate the numbers right but write the wrong answer, or it might get the direction wrong because it forgot to imagine standing at the dishwasher.

To fix this, they used Reinforcement Learning with a special "Spatial Rubric."

  • Imagine a teacher grading a math test.
  • Old Grading: "Did you get the right answer? Yes? +10 points."
  • New Grading (The Rubric): "Did you get the right answer? Yes. But did you show your work? Did you set up the coordinate system correctly? Did you check the orientation? If you guessed the right answer without doing the math, you get a penalty!"

This forces the AI to learn how to think spatially, not just memorize answers.

Why Does This Matter?

The paper shows that making the AI smarter isn't about making the brain bigger; it's about giving it better tools.

  • The Result: Their model, which is actually smaller than some giant commercial models (like GPT-5 or Gemini), beats them all at spatial reasoning tasks.
  • The Lesson: It's not about how many "neurons" the AI has; it's about whether it understands the 3D world. By translating video into a "Spatial Code" (a blueprint), they unlocked a level of understanding that raw video processing couldn't achieve.

In a nutshell: They taught the AI to stop staring at the picture and start reading the map.