Pursuing Minimal Sufficiency in Spatial Reasoning

Imagine you are trying to solve a complex mystery in a messy, cluttered room. You have a brilliant detective (the AI) who is very smart but easily overwhelmed. If you dump the entire room's contents onto their desk—every single sock, every crumb, every shadow—they might get confused, miss the crucial clue, or start guessing wildly.

This is exactly the problem the paper "Pursuing Minimal Sufficiency in Spatial Reasoning" tackles. It introduces a new way for AI to understand 3D spaces (like rooms, cities, or virtual worlds) by teaching it to be a smart editor rather than a hoarder.

Here is the breakdown using simple analogies:

The Problem: The "Cluttered Desk" Syndrome

Current AI models (Vision-Language Models) are like detectives trained only on 2D photos. They struggle to understand depth, distance, and orientation in 3D space.

The Issue: When you ask them, "Is the chair facing the window?", they try to look at everything in the picture at once.
The Result: They get "information overload." Too much irrelevant detail (like the color of the rug or a picture on the far wall) drowns out the important clues, causing them to hallucinate or guess wrong.

The Solution: The "Minimal Sufficient Set" (MSS)

The authors propose a new philosophy: Don't look at everything. Look only at what is strictly necessary to solve the puzzle.

Think of it like packing for a hiking trip. You don't need to pack your entire house. You only need the "Minimal Sufficient" gear: boots, a map, and water. Anything else is just dead weight. The AI needs to find this "perfectly packed backpack" of information before it tries to answer a question.

How It Works: The "Detective and the Editor" Team

To achieve this, the paper introduces MSSR, a system with two AI agents working together like a dynamic duo:

1. The Perception Agent (The "Scout")

Role: This agent is the eyes and hands. It goes into the 3D scene and gathers raw data.
The Superpower: It uses a special toolkit (like a Swiss Army knife) to measure things precisely.
- It can build a 3D map of the room from photos.
- It can pinpoint exactly where a chair is.
- The "SOG" Module: This is a clever trick. Instead of asking the AI to "guess" which way a person is facing (which is hard), the AI draws arrows on the image and asks, "Which arrow matches the description?" It turns a hard math problem into a simple "pick the right picture" game.
The Flaw: The Scout is enthusiastic! It gathers too much data. It brings back 18 facts when you only need 3.

2. The Reasoning Agent (The "Editor")

Role: This agent is the brain and the critic. It sits at the desk with the Scout's pile of data.
The Job: It reads the pile and asks: "Do I need this to answer the question?"
- If the question is "Is the chair facing the window?", the Editor throws away the facts about the rug, the lighting, and the color of the walls.
- It keeps only the chair's position, the window's position, and the direction the chair is facing.
The Loop:
1. Scout brings a big pile of data.
2. Editor cuts out the junk.
3. Editor checks the remaining pile: "Is this enough to solve the mystery?"
4. If No: The Editor sends a specific note back to the Scout: "I need the exact angle of the door. Go get that."
5. If Yes: The Editor solves the puzzle using only the clean, essential facts.

Why This is a Big Deal

Less is More: By forcing the AI to ignore distractions, it becomes much more accurate. It's like turning off the radio while driving so you can focus on the road.
No New Training: The system doesn't need to be retrained from scratch. It just uses a smarter way of asking questions and organizing answers.
Teaching Tool: Because the system writes down exactly what information it used and how it reasoned, it creates a perfect "study guide" for future AI models to learn from.

The Result

When tested on tough spatial reasoning challenges (like figuring out where objects are in a complex room), this "Detective and Editor" team beat almost every other AI model, including the most expensive ones from big tech companies.

In a nutshell: The paper teaches AI to stop trying to memorize the whole library and start learning how to find the one specific book it needs to solve the problem. By being a "minimalist," the AI becomes a "master."

1. Problem Statement

The paper addresses the persistent challenge of 3D spatial reasoning in Vision-Language Models (VLMs). Despite advancements, VLMs struggle to ground language in 3D understanding, leading to failures in tasks requiring layout, orientation, and depth perception. The authors identify two fundamental bottlenecks:

Inadequate 3D Perception: VLMs are predominantly trained on 2D data, lacking geometric priors necessary to understand 3D structures like depth and orientation.
Redundancy-Induced Reasoning Failure: 3D environments are information-dense. Naively aggregating all perceptual data floods the model's context window with irrelevant details. This "information overload" dilutes attention, encourages shortcut heuristics, and degrades reasoning performance.

The authors draw inspiration from cognitive science (mental models) and statistics (Minimal Sufficient Statistics), proposing that robust reasoning requires constructing a Minimal Sufficient Set (MSS)—the most compact representation of information strictly necessary to answer a specific query.

2. Methodology: MSSR Framework

The authors introduce MSSR (Minimal Sufficient Spatial Reasoner), a zero-shot, training-free dual-agent framework designed to iteratively curate an MSS.

A. Dual-Agent Architecture

The framework operates via a closed-loop interaction between two specialized agents:

Perception Agent (PA):
- Role: Acts as the perception engine, bridging high-level reasoning directives with raw 3D scene data.
- Mechanism: Uses Visual Programming to generate Python scripts that invoke a suite of specialized tools. It maintains a stateful execution environment, allowing it to build upon previous computations without redundancy.
- Key Module (SOG): Introduces the Situated Orientation Grounding (SOG) module. Instead of asking the VLM to regress 3D vectors directly (which is error-prone), SOG reframes orientation as a multi-choice visual selection task. It renders candidate 3D direction vectors (coarse-to-fine) onto 2D images (Situated and Canonical views) and asks the VLM to select the correct direction. This robustly handles complex, language-grounded directions (e.g., "the direction the person is facing while ascending stairs").
- Tools: Includes modules for 3D reconstruction (using VGGT), object localization, global coordinate calibration, and numerical computation.
Reasoning Agent (RA):
- Role: Acts as the cognitive core, ensuring the information set is both sufficient and minimal.
- Mechanism: Operates in a two-stage loop:
  - Plan-Guided Curation: The RA formulates a reasoning plan, scrutinizes the current information set ( $S_n$ ), and prunes any data not causally linked to the plan.
  - Strategic Decision:
    - : If the set is insufficient, the RA issues a targeted natural language request to the PA for specific missing data.
    - : If the set is sufficient, the RA discards all prior context and reasons exclusively over the curated MSS using Chain-of-Thought (CoT) to generate the final answer.

B. The Iterative Process

The process starts with an empty set. The PA gathers a broad, potentially redundant set of spatial primitives. The RA prunes this set. If insufficient, the loop repeats with targeted requests until the MSS is formed. The final answer is derived only from this minimal set, preventing distraction from irrelevant data.

3. Key Contributions

Formalization of Minimal Sufficiency: The paper formalizes 3D spatial reasoning as the construction of a Minimal Sufficient Set (MSS), addressing the dual challenges of 3D perception gaps and information redundancy.
Dual-Agent Framework: Introduces MSSR, a zero-shot framework that decouples perception and reasoning. It features a programmable Perception Agent with a novel SOG module for robust directional grounding and a Reasoning Agent that actively prunes and curates information.
Interpretable Reasoning Traces: The framework produces explicit reasoning paths (MSS + CoT), which serve as high-quality supervision data for training future 3D-aware models.

4. Experimental Results

The method was evaluated on two challenging benchmarks: MMSI-Bench (situated multi-view reasoning) and ViewSpatial-Bench (perspective relational understanding).

State-of-the-Art Performance:
- On MMSI-Bench, MSSR achieved 49.5% accuracy, outperforming the strongest proprietary LLM (o3 at 41.0%) by 8.5 percentage points and the best open-source model (Qwen3-VL-8B at 31.1%) by over 60%.
- On ViewSpatial-Bench, it achieved 51.8% overall accuracy, surpassing all baselines including specialist 3D-VLMs and agentic frameworks.
Ablation Studies:
- Minimality: Experiments showed a clear inverse correlation between set size and accuracy. Reducing the information set from 17.3 items to 5.9 items increased accuracy from 45.8% to 48.3%, proving that redundancy is a primary cause of error.
- Component Analysis: Removing the Reasoning Agent (Only PA) or the SOG module caused significant performance drops, validating the necessity of both pruning and robust orientation grounding.
Generalizability: MSSR consistently improved performance across various backbones (from 7B to 72B parameters). It also demonstrated a cost-effective deployment strategy: using a strong model for the Perception Agent and a lighter model for the Reasoning Agent retained 90% of the performance while reducing costs.
Data Annotation: Fine-tuning a 7B model on data annotated by MSSR improved its accuracy by 4.2%, demonstrating the framework's utility as a data engine.

5. Significance

Paradigm Shift: MSSR moves away from the "accumulate everything" approach common in agentic systems, advocating instead for active information curation. It proves that for complex spatial tasks, less information (if it is the right information) leads to better reasoning.
Zero-Shot Efficiency: Unlike methods requiring expensive 3D instruction tuning or retraining, MSSR is a zero-shot framework that leverages existing VLM capabilities enhanced by tool use.
Scalability and Interpretability: By producing structured, minimal reasoning traces, MSSR not only solves the immediate task but also provides a scalable method for generating high-quality training data to distill spatial reasoning capabilities into future models.
Robustness: The introduction of the SOG module solves a critical gap in existing VLMs: the inability to ground complex, situational orientations in 3D space without costly end-to-end training.

In conclusion, MSSR demonstrates that explicitly pursuing minimal sufficiency through a collaborative dual-agent loop is the key to unlocking robust, high-accuracy spatial reasoning in Vision-Language Models.