Make Geometry Matter for Spatial Reasoning

The Big Problem: The "Two-Eyed" Blind Spot

Imagine you have a very smart robot assistant (a Vision-Language Model) that can look at a photo or a video and describe what it sees. It's great at saying, "That's a red car," or "The dog is chasing the ball."

But ask it a tricky question like, "If I walk around the car to the left, will I see the driver?" or "How far is that tree from the building?"

Suddenly, the robot gets confused. It tries to guess based on how things look (2D colors and shapes) rather than how they actually exist in 3D space. It's like trying to judge the depth of a swimming pool just by looking at a flat painting of it.

The Failed Fix: The "Useless Map"

Researchers tried to fix this by giving the robot a 3D map (geometry tokens) alongside the picture. They thought, "If we give the robot a blueprint, it will finally understand space!"

But here's the surprising discovery the paper makes: The robot ignored the map.

Even with the 3D map in its hand, the robot kept relying on the flat picture because it was easier. It was like giving a GPS to a driver who refuses to look at the screen and just keeps driving by looking out the window. In fact, sometimes adding the map made the robot worse because it got distracted by the extra information it didn't know how to use.

The Solution: GeoSR (The "Smart Coach")

The authors created a new framework called GeoSR to force the robot to actually use the 3D map. They did this with two clever tricks:

1. The "Blindfold Training" (Geometry-Unleashing Masking)

The Analogy: Imagine you are teaching a student to navigate a maze. If you let them see the whole maze, they might just memorize the colors of the walls. But if you blindfold them and only let them see the map, they have to learn how to use the map to find their way.

How it works:
During training, GeoSR randomly covers up (masks) parts of the 2D picture the robot is looking at.

Static scenes: It randomly hides parts of the image.
Moving videos: It uses the robot's own attention to figure out which parts of the image are most important, then hides those specific parts.
The Result: The robot is forced to panic and say, "Wait, I can't see the picture! I have to look at the 3D map to answer this!" This breaks its bad habit of ignoring the geometry.

2. The "Smart Traffic Light" (Geometry-Guided Fusion)

The Analogy: Imagine a traffic light at an intersection. Sometimes you need to stop for a red light (the 2D picture is clear), and sometimes you need to stop for a construction sign (the 3D map is needed). A dumb system would just mix the two signals together, causing confusion. A smart system changes the light based on the situation.

How it works:
Instead of just mixing the 2D picture and the 3D map together equally, GeoSR uses a "gatekeeper" (a learned gate mechanism).

If the robot is looking at a clear, simple object, the gate says, "Trust the picture."
If the robot is looking at a tricky angle or a moving object where the picture is misleading, the gate says, "Stop! Trust the 3D map right here."
The Result: The robot learns to switch between the picture and the map intelligently, using the right tool for the right job.

The Outcome: A Master Navigator

When the researchers tested this new system:

Static Scenes: The robot got much better at answering questions about distance and direction in still photos.
Dynamic Scenes (Videos): The improvement was huge. The robot could now track moving objects and understand how space changes over time, something previous models struggled with.

The Takeaway

The paper teaches us that just giving a model more data (like 3D geometry) isn't enough. You have to teach it how to use that data.

By temporarily hiding the easy answers (the 2D picture) and building a smart system to decide when to use the 3D map, the authors turned a confused robot into a spatial reasoning expert. They didn't just add a tool; they taught the robot how to hold it.

1. Problem Statement

Vision-Language Models (VLMs) have achieved remarkable success in general image and video understanding but struggle significantly with spatial reasoning (e.g., determining 3D relationships, distances, and temporal evolution of objects).

The Gap: While recent approaches attempt to bridge this gap by injecting geometry tokens (derived from pretrained 3D foundation models) into VLMs, they often fail to improve performance.
The Core Issue: The authors observe a counterintuitive phenomenon: under standard "naive token fusion" followed by fine-tuning, VLMs tend to ignore the injected geometry tokens. Instead, they rely heavily on 2D appearance shortcuts (visual cues like color and texture) which are insufficient for robust spatial reasoning.
Consequences: In static scenes, geometry injection yields marginal gains. In dynamic scenes (videos), naive fusion can even degrade performance compared to models without geometry, as the model treats geometric evidence as dispensable noise rather than actionable data.

2. Methodology: GeoSR Framework

To address the underutilization of geometry, the authors propose GeoSR, a framework designed to force VLMs to actively reason with geometric information. It consists of two complementary components:

A. Geometry-Unleashing Masking (Training Strategy)

This component is designed to suppress non-geometric shortcuts during training, compelling the model to consult geometry tokens.

Mechanism: During training, a subset of 2D vision tokens is strategically masked (zeroed out).
Static Scenes: Uses Random Masking (similar to MAE), masking a fixed ratio ( $\gamma$ ) of vision tokens to force reliance on the geometry branch.
Dynamic Scenes: Uses Top-K Masking. The model first uses a QFormer-like mechanism to identify which geometry tokens are most relevant to the specific question (based on attention scores). It then masks the corresponding 2D vision tokens associated with those critical regions.
Goal: By removing the "easy" visual cues, the model is forced to "unleash" the potential of the geometry tokens to answer spatial questions.

B. Geometry-Guided Fusion (Inference/Integration Strategy)

This component ensures that geometry tokens are reasonably and adaptively integrated, rather than uniformly mixed.

Mechanism: Introduces a gated routing mechanism that learns a weight ( $\alpha$ ) for each token and channel.
Function:
- The gate computes a fusion weight based on both the visual features ( $\tilde{F}_V$ ) and geometry features ( $\tilde{F}_G$ ).
- Formula: $F = \alpha \odot V + (1-\alpha) \odot G$ .
- The gate allows the model to adaptively amplify the contribution of geometry tokens in regions where geometric evidence is critical (e.g., occluded areas or regions with motion) and rely on vision where appearance is sufficient.
Dynamic Adaptation: For dynamic scenes, the compact geometric evidence ( $Z_G$ ) is redistributed back to the token level via cross-attention before being fused, ensuring fine-grained control.

3. Key Contributions

Reproducible Finding: The paper identifies and validates that naive injection of geometry tokens often fails or harms performance because VLMs default to 2D appearance shortcuts.
GeoSR Framework: A novel architecture combining Geometry-Unleashing Masking (to force reliance on geometry) and Geometry-Guided Fusion (to adaptively route geometric evidence).
State-of-the-Art Performance: The framework establishes new SOTA on both static and dynamic spatial reasoning benchmarks, proving that geometry can be made "actionable" with the right training and fusion strategies.

4. Experimental Results

The authors evaluated GeoSR on two major benchmarks: VSI-Bench (static scenes) and DSR-Bench (dynamic scenes).

Static Spatial Reasoning (VSI-Bench):
- GeoSR achieved an average score of 51.9, outperforming previous SOTA methods like VG-LLM (50.7) and Spatial-MLLM (48.4).
- It showed consistent improvements across subtasks like object counting, distance estimation, and route planning.
Dynamic Spatial Reasoning (DSR-Bench):
- GeoSR achieved an average score of 66.1, significantly outperforming the previous best (GSM at 58.9) and proprietary models like GPT-4o (26.4).
- Notably, the ablation study showed that removing the geometry branch entirely (f) performed better than naive fusion (e) in dynamic settings, confirming the paper's hypothesis that uncontrolled injection is harmful. GeoSR reversed this trend.
Ablation Studies:
- Removing Geometry-Unleashing Masking caused a performance drop, confirming the necessity of suppressing visual shortcuts.
- Replacing Geometry-Guided Fusion with naive fusion also reduced performance, highlighting the need for adaptive gating.
Efficiency: GeoSR adds negligible computational overhead (runtime increased from 0.40s to 0.41s) and a minor parameter increase (~0.07B), making it highly efficient.

5. Significance

Paradigm Shift: The paper challenges the prevailing assumption that simply "injecting" 3D priors into VLMs is sufficient. It demonstrates that training dynamics (forcing the model to use the data) and fusion mechanisms (controlling how data is used) are critical.
Robustness: By forcing the model to rely on geometry when visual cues are ambiguous or misleading (e.g., occlusion, motion), GeoSR creates more robust spatial reasoning capabilities essential for real-world applications like robotics and autonomous driving.
Generalizability: The approach is applicable to both static and dynamic scenarios, offering a unified solution for 3D understanding in VLMs without requiring expensive multi-sensor inputs or complex reconstruction pipelines.