pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Imagine you are trying to solve a mystery in a house you've never visited, but you only have a few blurry photos taken from different corners. You need to answer a tricky question like, "If I'm standing in the kitchen looking at the fridge, what's hidden behind the sofa in the living room?"

Most current AI models are like amazing guessers. They look at your photos and try to "imagine" the rest of the house in their mind. Sometimes they get lucky, but often they get confused, mix up left and right, or hallucinate furniture that isn't there. It's like trying to build a 3D model of a house in your head just by looking at a flat drawing; it's hard to get the depth and angles right.

Enter pySpatial, the new framework introduced in this paper. Think of pySpatial not as a guesser, but as a smart architect with a magic toolkit.

Here is how it works, broken down into simple steps:

1. The "Magic Blueprint" (3D Reconstruction)

Instead of just staring at the photos, pySpatial takes those flat 2D images and instantly builds a virtual 3D model of the room.

Analogy: Imagine taking a stack of 2D blueprints and a laser scanner to instantly print a full-scale, walkable cardboard model of the house. Now, the AI isn't guessing; it has a physical (digital) object to inspect.

2. The "Robot Intern" (Visual Programming)

This is the coolest part. The AI doesn't just "think" about the answer; it writes a Python program (a set of instructions) to find the answer for itself.

Analogy: Imagine you have a robot intern. Instead of asking the intern, "What's behind the sofa?" and hoping they guess right, you give them a checklist:
1. Go to the spot where the photo was taken.
2. Turn the camera 90 degrees to the left.
3. Take a new picture of what you see now.
4. Show me that picture.

The AI generates this checklist (the code), runs it, and gets a new, synthesized photo that proves the answer.

3. The "New Perspective" (Novel View Synthesis)

Once the robot intern takes that new picture, the AI looks at it to answer the question.

Analogy: If you asked, "What's to the left of the blue chair?", the AI doesn't guess. It literally rotates the camera in its 3D model, snaps a photo of the left side, and says, "Ah, I see a blue trash can there."

Why is this a big deal?

No Training Needed: Most AI models need to be "trained" on millions of examples to learn how to do this. pySpatial works zero-shot, meaning it can walk into a brand new, weird room it has never seen before and solve the puzzle immediately, just by using its tools.
It's Transparent: Because the AI writes code, we can see exactly how it solved the problem. If it gets it wrong, we can look at the code and say, "Oh, you rotated the wrong way!" It's not a black box; it's a clear, logical process.
Real-World Use: The paper shows this working on a real robot (a four-legged dog robot). The robot used pySpatial to navigate a real office, go through doors, and find a toy mushroom, all without crashing.

The Bottom Line

Before pySpatial, AI was like a fortune teller trying to guess the layout of a room based on a few clues. With pySpatial, the AI becomes a detective who builds a 3D model, walks around it virtually, takes new photos, and finds the answer with proof.

It turns the hard problem of "spatial reasoning" (understanding space) into a simple game of "follow the instructions," making AI much safer and smarter for tasks like robot navigation and augmented reality.

1. Problem Statement

Multi-modal Large Language Models (MLLMs) have achieved remarkable success in general perception and reasoning tasks (e.g., image captioning, video understanding). However, they struggle significantly with 3D spatial reasoning, particularly in scenarios involving:

Limited Views: Environments observed only through sparse 2D image sequences.
Multi-View Relations: Tasks requiring the model to reason across different camera perspectives (e.g., "What is to the left of object X if I move from view 1 to view 2?").
Implicit Imagination: Current approaches often rely on the model's internal "imagination" or 2D cognitive maps to simulate 3D layouts, which leads to hallucinations and geometric inconsistencies.

Existing state-of-the-art MLLMs perform only slightly better than random guessing on benchmarks like MINDCUBE, which tests reasoning over egocentric motion and multi-view relations. The core challenge is the lack of an explicit geometric foundation for these models to ground their reasoning.

2. Methodology: pySpatial

The authors propose pySpatial, a zero-shot visual programming framework that equips MLLMs with the ability to interact with 3D environments via Python code generation. Instead of relying on implicit reasoning, pySpatial forces the model to explicitly construct and manipulate 3D representations.

Core Workflow

Input: A sequence of 2D images ( $I$ ) and a natural language query ( $q$ ).
3D Reconstruction: The framework first uses feed-forward 3D reconstruction models (e.g., VGGT for normalized space, CUT3R for metric scale) to convert the 2D image sequence into an explorable 3D scene. This yields a point cloud ( $P$ ), camera intrinsics ( $K$ ), and extrinsics ( $G$ ).
Code Generation (The Agent): An MLLM (acting as a code agent, e.g., GPT-4o) generates a Python program ( $z$ $z$ ) using a predefined pySpatial API.
- The agent does not have access to internal model weights but interacts with a high-level interface.
- The program composes function calls to "spatial tools" to solve the specific query.
Program Execution: A Python interpreter executes the generated code against the reconstructed 3D scene.
- Spatial Tools: The API includes functions for:
  - reconstruct(): Build the 3D scene.
  - rotate_left/right(), move_forward/backward(), turn_around(): Manipulate camera poses.
  - synthesize_novel_view(): Render new images from arbitrary camera poses within the 3D scene.
  - describe_camera_motion(): Convert matrix transformations into natural language descriptions.
Final Reasoning: The MLLM receives the original images, the query, and the intermediate output ( $O$ ) from the program (e.g., a newly synthesized image showing the "left" side of an object). It uses this explicit visual evidence to generate the final answer.

Key Design Features

Zero-Shot: No gradient-based fine-tuning is required. The framework works with pre-trained MLLMs.
Interpretability: The generated Python code serves as an explicit intermediate representation, allowing researchers to inspect, debug, and verify the reasoning steps.
Composability: The framework allows for complex control flows (loops, conditionals) to handle diverse spatial tasks.

3. Key Contributions

pySpatial Framework: A novel zero-shot framework that enables MLLMs to reason explicitly in 3D space by generating and executing visual programs that orchestrate spatial tools (reconstruction, camera manipulation, novel view synthesis).
Explicit 3D Grounding: Unlike previous methods relying on 2D cognitive maps or implicit imagination, pySpatial converts 2D inputs into an explorable 3D point cloud, allowing for geometrically consistent reasoning.
Comprehensive Evaluation:
- Benchmarks: Extensive testing on MINDCUBE (multi-view) and OMNI3D-BENCH (single-view).
- Real-World Application: Deployment on a quadrupedal robot (Unitree Go1) for indoor navigation, demonstrating the transfer of reasoning capabilities to physical agents.
Performance: Demonstrates that explicit 3D reasoning via code generation significantly outperforms both open-weight and proprietary MLLMs, as well as specialized spatial models.

4. Experimental Results

Quantitative Performance

MINDCUBE Benchmark:
- pySpatial achieved an overall accuracy of 58.56%.
- It outperformed the strongest proprietary baseline (GPT-4.1-mini) by 12.94%.
- It surpassed the best open-weight model (DeepSeek-VL2-Small) by 10.94%.
- In the "Among" category (reasoning about an object relative to all others), it achieved 60.54%, while no baseline exceeded 50%.
MINDCUBE-1k:
- Outperformed "Spatial Mental Models" (Yin et al., 2025) by roughly 20%.
- Surpassed VADAR (a prior visual programming approach) by 21.9%, even when VADAR was equipped with the same 3D reconstruction module.
OMNI3D-BENCH (Single-View):
- Achieved a new state-of-the-art with 44.2% total accuracy, outperforming VADAR by 3.8% and GPT-4o.

Qualitative & Real-World Results

Qualitative Analysis: Generated programs successfully handled complex tasks like "What is to the right of the blue bag?" by rotating the camera and synthesizing a novel view, providing visual evidence that grounded the final answer.
Robot Navigation: In a 50 $m^2$ indoor environment, pySpatial successfully generated motion plans for a quadrupedal robot to navigate through doorways and reach a target object. In contrast, the GPT-4.1 baseline failed due to incorrect direction estimation and collision-prone trajectories.

Failure Analysis

Error analysis on MINDCUBE showed that only 6% of failures were due to incorrect program generation.
20% of errors stemmed from the final reasoning step (MLLM misinterpreting the visual clue).
13% were due to limitations in the 3D reconstruction model (e.g., poor point cloud quality).
This indicates the framework's pipeline is robust, and future gains can be achieved by improving reconstruction and reasoning models.

5. Significance and Impact

Paradigm Shift: pySpatial moves the field from "implicit imagination" to "explicit geometric reasoning." It proves that MLLMs do not need to be fine-tuned on massive 3D datasets to reason spatially; they only need the ability to call tools that construct 3D representations.
Generalizability: The framework works across different MLLMs (GPT-4o, GPT-4.1, Qwen, etc.) and different 3D reconstruction backbones (VGGT, CUT3R, Pi3), suggesting the approach is model-agnostic.
Practical Utility: The successful deployment on a physical robot demonstrates that this approach is not just a benchmark trick but a viable solution for embodied AI, robotics, and augmented reality where precise spatial understanding is critical for safety and navigation.
Efficiency: The framework is efficient, averaging 7.45 seconds per query on a single GPU, making it suitable for real-time or near-real-time applications.

In conclusion, pySpatial establishes a new standard for 3D spatial reasoning by leveraging visual programming to bridge the gap between 2D visual inputs and 3D geometric understanding, achieving state-of-the-art results without the need for expensive fine-tuning.