Thinking with Spatial Code for Physical-World Video Reasoning

Imagine you are watching a video of a messy living room. You see a cat jump off a sofa, run past a coffee table, and hide under a chair.

Current AI models (like the smartest chatbots today) watch this video like a human who only has eyes but no brain for geometry. They see a "brown blob" moving, then a "white blob" moving. They can describe what they see ("The cat ran"), but they often get confused about where things are in 3D space. If you ask, "Is the cat to the left or right of the table from the table's perspective?", these models might get it wrong because they are just guessing based on how the picture looks on the screen, not understanding the actual 3D room.

This paper introduces a new way of thinking called "Thinking with Spatial Code."

Here is the simple breakdown of how it works, using a few analogies:

1. The Problem: The "Blurry Photo" vs. The "Blueprint"

Imagine trying to solve a puzzle using a blurry, 2D photo of the pieces. You can guess the shapes, but you don't know exactly how deep they are or how they fit together in 3D.

Most AI tries to solve video questions by looking at the "blurry photo" (the raw video pixels). It's good at recognizing faces or colors, but terrible at understanding distance, orientation, and 3D layout.

2. The Solution: The "Architect's Blueprint"

The authors' new framework doesn't just look at the video; it translates the video into a 3D Blueprint (which they call "Spatial Code").

Think of it like this:

The Old Way: The AI looks at a video of a kitchen and says, "I see a stove and a fridge."
The New Way: The AI acts like a super-fast architect. It watches the video and instantly draws a 3D blueprint. It writes down:
- Stove: Located at coordinates (X, Y, Z), facing North, size 2x2 meters.
- Fridge: Located at coordinates (X+3, Y, Z), facing North, size 1x1 meters.

This "Blueprint" is a list of facts, not a picture. It turns the messy video into clean, mathematical data.

3. The Process: Two Steps to Genius

The system works in two distinct stages, like a construction crew and an interior designer working together.

Step A: The Construction Crew (The Spatial Encoder)
This is the part that watches the video. It uses special tools (like a digital version of "Segment Anything" and "Depth Anything") to:

Identify objects: "That's a sofa."
Track them: "The sofa stayed in the same spot while the camera moved."
Measure them: "The sofa is 2 meters long and facing the TV."
Output: It spits out the "Spatial Code" (the blueprint).

Step B: The Interior Designer (The Language Model)
Now, instead of giving the Language Model (the "brain") the raw video, we give it the Blueprint.

Question: "If I'm standing at the dishwasher facing the table, is the washer to my left or right?"
Old AI: Looks at the video, gets confused by the camera angle, and guesses.
New AI: Reads the blueprint. It sees the exact coordinates of the dishwasher, the table, and the washer. It does a quick math calculation (like a GPS) and says, "Ah, the washer is at coordinate X, which is definitely to the front-left."

4. The Secret Sauce: "The Rubric" (The Strict Teacher)

The authors found that even with the blueprint, the AI sometimes makes "logic errors." It might calculate the numbers right but write the wrong answer, or it might get the direction wrong because it forgot to imagine standing at the dishwasher.

To fix this, they used Reinforcement Learning with a special "Spatial Rubric."

Imagine a teacher grading a math test.
Old Grading: "Did you get the right answer? Yes? +10 points."
New Grading (The Rubric): "Did you get the right answer? Yes. But did you show your work? Did you set up the coordinate system correctly? Did you check the orientation? If you guessed the right answer without doing the math, you get a penalty!"

This forces the AI to learn how to think spatially, not just memorize answers.

Why Does This Matter?

The paper shows that making the AI smarter isn't about making the brain bigger; it's about giving it better tools.

The Result: Their model, which is actually smaller than some giant commercial models (like GPT-5 or Gemini), beats them all at spatial reasoning tasks.
The Lesson: It's not about how many "neurons" the AI has; it's about whether it understands the 3D world. By translating video into a "Spatial Code" (a blueprint), they unlocked a level of understanding that raw video processing couldn't achieve.

In a nutshell: They taught the AI to stop staring at the picture and start reading the map.

1. Problem Statement

Current Large Multimodal Models (MLLMs) struggle with physical-world visual question answering (VQA) from videos. While they excel at describing visual appearance, they fail to reason about:

3D Structure: Where objects are located in 3D space.
Spatial Relationships: How objects are oriented relative to one another.
Temporal Dynamics: When objects appear, disappear, or move.
Perspective: Reasoning from specific viewpoints (e.g., "from the sofa's perspective").

Existing models rely heavily on 2D appearance cues and linguistic priors, leading to "shortcut" solutions that lack explicit geometric grounding. The authors argue that the bottleneck for spatial reasoning is not the scale of the language model, but the quality of the 3D spatial representation derived from the video.

2. Methodology: Thinking with Spatial Code

The proposed framework transforms raw RGB video into explicit, temporally coherent 3D spatial codes before feeding them to a Large Language Model (LLM). The system consists of three main stages:

A. Spatial Encoder (Perception Module)

This module converts streaming video into structured symbolic data. It employs a dual-encoder architecture:

Semantic Encoder: Uses SAM-2 (Segment Anything 2) to extract object-level features and perform tracking across frames.
Geometric Encoder: Uses Depth Anything 3 to extract 3D-aware depth features.
Fusion & Prediction:
- Features are fused via cross-attention.
- A 3D Detection Head predicts 6D object parameters for each frame: semantic label ( $l$ ), 3D position ( $p \in \mathbb{R}^3$ ), size ( $s \in \mathbb{R}^3$ ), and orientation/quaternion ( $r \in \mathbb{R}^4$ ).
- A Depth Head provides dense geometric supervision (pixel-level depth maps and camera parameters) to stabilize learning in background regions.
- Temporal Fusion: Frame-wise predictions are aggregated into a scene-level Spatial Code ( $c$ ) by merging overlapping 3D bounding boxes using spatial proximity and 3D IoU clustering.

Output Format: The video is converted into a list of JSON-like objects containing explicit 3D coordinates, dimensions, and orientations, along with a global scene caption.

B. LLM Reasoning (Inference Module)

Instead of prompting the LLM with raw video frames, the system prompts a text-only LLM (e.g., Qwen3-4B) with the structured spatial codes.

The LLM performs explicit coordinate-based reasoning.
It leverages the provided 3D variables to calculate distances, directions, and relative orientations mathematically rather than relying on visual hallucination.
The LLM can also incorporate commonsense knowledge (e.g., understanding that "front of a sofa" implies a canonical facing direction) to interpret the coordinates.

C. Reinforcement Learning with Spatial Rubric Rewards

To further refine the LLM's reasoning capabilities, the authors employ Group Relative Policy Optimization (GRPO).

Reward Function: A composite reward $r(y|x)$ is used:
$r = r_{accuracy} + r_{format} + r_{rubric}$
Spatial Rubric Reward ( $r_{rubric}$ ): This is the key innovation. Instead of just rewarding the final correct answer, it evaluates the reasoning process along specific dimensions:
- Perspective Awareness: Did the model construct a local coordinate system relative to the observer/object?
- Orientation Awareness: Did it correctly use yaw angles?
- Directional Consistency: Did it avoid viewer-centric errors (confusing global axes with object-relative axes)?
- Penalties: The model is penalized for "lucky guesses" (correct answer, wrong reasoning) or skipping coordinate transformations.

3. Key Contributions

New Paradigm: Introduced "Thinking with Spatial Code," shifting from end-to-end video-to-answer models to a Perception-then-Reasoning pipeline where explicit 3D codes bridge the gap between vision and language.
Unified Spatial Encoder: Developed a training recipe that unifies dual visual encoding (SAM-2 + Depth Anything 3), 6D object parsing/tracking, and geometric densification to generate structured spatial codes from RGB video.
Spatial Rubric Reward: Proposed a novel RL reward mechanism that enforces perspective-aware and geometrically grounded reasoning, addressing the "reasoning-action disconnect" where models analyze correctly but answer incorrectly.
Empirical Insight: Demonstrated that perception quality is the critical bottleneck for spatial reasoning, not just model scale.

4. Experimental Results

The method was evaluated on VSI-Bench (video spatial reasoning) and Video-RoboSpatial.

Performance on VSI-Bench:
- The proposed method (4B parameters) achieved 57.0% accuracy, outperforming proprietary giants like GPT-5o (55.0%) and Gemini-2.5-Pro (53.5%), as well as open-source models like Qwen3-VL-8B (55.0%).
- With 2D bounding box annotations, accuracy reached 60.0%.
- Reinforcement Learning with spatial rubrics provided a consistent +3.4% to +3.5% gain over the non-RL baseline.
3D Perception:
- The Spatial Encoder achieved state-of-the-art F1@0.25 scores on ARKitScenes (0.156) and ScanNet (0.209), surpassing both image-based detectors and point-cloud-based methods (despite only using RGB video input).
Ablation Studies:
- Perception vs. Reasoning: When provided with Ground Truth spatial codes, the same 4B LLM achieved 73.2% accuracy (vs. 60.0% with predicted codes), proving that improving perception directly boosts reasoning.
- Model Scale: Increasing model size (from 4B to 230B) in standard MLLMs yielded diminishing returns (plateauing at ~50-55%), whereas the proposed method with explicit codes broke this ceiling.

5. Significance

Bottleneck Identification: The paper fundamentally challenges the "bigger model is better" narrative for spatial tasks, showing that representation quality (explicit 3D structure) is more critical than parameter count.
Interpretability: By forcing the model to reason over explicit coordinates, the "black box" nature of MLLM reasoning is reduced, making errors easier to diagnose (e.g., distinguishing between a perception error and a reasoning error).
Generalizability: The approach works with open-vocabulary detection and does not require 3D sensors (LiDAR/RGB-D), relying solely on RGB video, making it applicable to real-world robotics and autonomous navigation scenarios.
Open Source: The authors commit to releasing code, models, and training recipes, facilitating further research in 3D video understanding.

In conclusion, Thinking with Spatial Code establishes a new state-of-the-art by decoupling 3D perception from language reasoning, using structured spatial codes as a "common language" to enable precise, geometrically grounded inference in physical-world video tasks.