Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Imagine you are trying to give someone directions to a hidden treasure in a room you've never seen before, but you can only describe it using words like "the chair is kind of near the table." That's how current AI models try to understand space. They use a rough, fuzzy map made of words and general ideas. It works okay for simple things, but if you need to know exactly how far the chair is from the table, or if you need to turn a corner and know where the chair is relative to your new viewpoint, that fuzzy map falls apart.

This paper introduces Video2Layout, a new way to teach AI to see the world with "laser precision."

Here is the simple breakdown using some everyday analogies:

1. The Problem: The "Pixelated" Map vs. The "Blueprint"

Think of how old video games used to work. They used Grid Maps. Imagine a room divided into a giant checkerboard (like a 10x10 grid). If a chair is in a square, the AI just knows, "The chair is in square B4."

The Flaw: This is too rough. Is the chair in the middle of B4? The corner? Is it 1 meter away or 5? The grid doesn't know. It's like trying to measure a room with a ruler that only has inches, not millimeters.

Video2Layout replaces the checkerboard with a Blueprint. Instead of saying "Square B4," the AI learns to say, "The chair is at coordinates (-5.9, 5.7) and is exactly 1.2 meters wide." It uses continuous numbers (like a real ruler) instead of blocks. This allows the AI to do actual math on the space, not just guess.

2. The Solution: Two-Stage Training (The "Simulator" and the "Real World")

Teaching an AI to do this is hard because real-world data is messy and expensive to label. So, the authors used a two-step training method:

Stage 1: The Video Game Simulator (Supervised Fine-Tuning)
Imagine teaching a pilot by putting them in a perfect flight simulator. The AI is fed thousands of videos from a virtual world (AI2THOR). In this world, the computer knows exactly where every object is. The AI learns to look at the video and draw a perfect "blueprint" of the room, matching the virtual coordinates. It learns the rules of geometry and how to turn a video into a math problem.
Stage 2: The Real World Flight (Reinforcement Fine-Tuning)
Now, take that pilot out of the simulator and into a real plane. The real world is messier; the lighting changes, and the camera shakes. The AI is now shown real videos (from a dataset called ScanNet). Instead of giving it the answer key, the AI tries to solve the puzzle, and if it gets the answer right, it gets a "reward." This helps it learn to apply its perfect simulator skills to the messy, real world.

3. How It Thinks: The "Architect" vs. The "Poet"

When you ask a normal AI a spatial question, it acts like a Poet. It writes a long, flowery story: "The chair is to the left of the table, maybe a bit behind..." This is vague and prone to errors.

Video2Layout forces the AI to act like an Architect.

The Map Module: It first draws a precise bird's-eye view blueprint with exact coordinates.
The Think Module: It does math. Instead of guessing, it calculates the distance: "Distance = Square Root of (x2-x1) squared..."
The Answer Module: It gives the final answer based on that math.

4. The Results: Why It Matters

The researchers tested this new "Architect" AI against the old "Poet" AI and even against humans.

Better Accuracy: It got about 3.24% better on average than the best grid-based models. In the world of AI, that's a huge jump.
Directional Genius: It became incredibly good at answering questions like, "If I am facing the TV, where is the dog bed?" It could mentally rotate the room and give the answer with near-perfect accuracy, even beating human performance in some direction tasks.
The Catch: It still struggles a bit with guessing exact distances for very far-away objects (like trying to measure a mountain from a mile away), but for room-scale tasks, it's a game-changer.

The Big Picture

Video2Layout is like giving an AI a pair of glasses that lets it see the world in 3D coordinates instead of just 2D pictures. By forcing the AI to stop guessing with words and start calculating with numbers, it finally understands the physical world the way humans do: not just as a collection of objects, but as a precise, measurable space. This is a major step toward robots that can actually navigate our homes without bumping into things.

1. Problem Statement

Multimodal Large Language Models (MLLMs) struggle with spatial intelligence, particularly in fine-grained spatial reasoning tasks. Current approaches often rely on grid-based cognitive maps (discretized $M \times M$ grids) to represent spatial relationships. These methods suffer from several intrinsic limitations:

Discretization Errors: They approximate continuous space, leading to inaccuracies in real-world distance, object size, and precise direction.
Ambiguity: Natural language descriptions of spatial relationships are often ambiguous, and grid maps fail to resolve this for complex geometric tasks.
Overlap Issues: Objects within a single grid cell can overlap, causing confusion in fine-grained reasoning.
Sim-to-Real Gap: Many existing methods rely heavily on simulated data or single-frame inputs, failing to generalize to dynamic, real-world video scenarios.

The core challenge is how to enable MLLMs to construct a metric-grounded (continuous, real-world coordinate-based) spatial representation from video inputs to perform rigorous quantitative spatial reasoning.

2. Methodology: Video2Layout Framework

The authors propose Video2Layout, a framework that reconstructs metric-grounded spatial layouts using continuous object boundary coordinates (Bird's-Eye View bounding boxes) instead of discrete grids. The framework operates in two main stages:

A. Data Preparation

V2LO-28K Dataset: A curated dataset comprising:
- SFT Training Set (16K): 12K samples from the AI2THOR simulator (providing precise ground-truth coordinates) and 4K general VQA samples.
- RL Training Set (8K): Real-world data from ScanNet, designed to bridge the sim-to-real gap.
- QVS-Bench (4K): A strictly isolated test set from ScanNet for evaluating generalization.
QA Generation: An automated pipeline generates two types of questions:
- Multiple-Choice QA: Relative distance, vertical/horizontal direction.
- Numerical QA: Minimum distance, object counting.

B. Two-Stage Training Paradigm

Supervised Fine-Tuning (SFT):
- Goal: Teach the model to map visual inputs to precise boundary coordinates and adopt a structured reasoning format.
- Structured CoT: The model outputs are structured into three modules:
  - Map Module: Constructs a Cartesian coordinate system (BEV) and outputs object bounding boxes.
  - Think Module: Performs explicit mathematical operations (e.g., Euclidean distance, vector dot products) based on the coordinates.
  - Answer Module: Generates the final response based on the computed results.
- Data: Trained on AI2THOR simulated data and general VQA data.
Reinforcement Fine-Tuning (RFT):
- Goal: Enhance generalization to real-world scenarios and close the sim-to-real gap.
- Algorithm: Uses Group Relative Policy Optimization (GRPO).
- Reward Design: Since real-world data lacks fine-grained coordinate annotations, the model is trained using result-level supervision:
  - Format Reward: Ensures the output adheres to the structured schema.
  - Task Reward: Binary reward for multiple-choice questions; relative accuracy reward for numerical questions.
- Data: Trained on real-world ScanNet data.

3. Key Contributions

Video2Layout Framework: An innovative approach that shifts from discrete grid maps to continuous, metric-grounded cognitive maps (BEV bounding boxes), enabling quantitative spatial computation.
Two-Stage Training Paradigm: A novel combination of SFT (using high-quality simulated data for precise coordinate learning) and RFT (using real-world data for generalization), effectively bridging the sim-to-real gap without requiring expensive real-world coordinate annotations.
Comprehensive Analysis: The authors conducted an in-depth study quantifying the relationship between cognitive map accuracy and task performance, identifying how factors like frame count, object distance, and camera rotation affect reasoning.
State-of-the-Art Performance: The proposed V2LO-7B model achieves significant improvements over existing grid-based and free-text reasoning baselines.

4. Experimental Results

The model was evaluated on mainstream benchmarks including QVS-Bench, EmbSpatial-Bench, ViewSpatial-Bench, OmniSpatial-Bench, and SPAR-Bench.

Overall Performance: V2LO-7B achieved an average accuracy of 47.46% across open benchmarks, a 3.29% absolute improvement over the base Qwen2.5-VL-7B and a 3.24% improvement over grid-map-based baselines.
Comparison with SOTA: It outperforms closed-source models like GPT-4o (46.25%) and GPT-5 (43.57% on QVS-Bench), as well as open-source spatial models like SpaceR-7B.
Task-Specific Strengths:
- Directional Reasoning: Achieved 73.0% (vertical) and 72.0% (horizontal), surpassing human baselines in these specific tasks due to the ability to construct local coordinate systems.
- Numerical Estimation: While still lagging behind human intuition in precise metric estimation, the model showed significant gains over previous methods.
Ablation Studies:
- Grid vs. Metric: Grid maps (even at high resolutions like 40x40) plateaued in performance (~46%), whereas the metric-grounded approach reached 49.28%.
- Training Stages: The combination of SFT + Real-Data RL yielded the best results, proving that SFT provides the necessary structural foundation for RL to generalize effectively.

5. Significance and Insights

Metric Grounding is Crucial: The paper demonstrates that converting spatial reasoning into mathematical computation via continuous coordinates is superior to relying on natural language or discrete grids. This reduces ambiguity and allows for rigorous geometric reasoning.
Sensitivity Analysis: The study reveals that cognitive map accuracy is highly sensitive to object distance (performance drops significantly as distance increases) and camera rotation (large angles introduce drift), but is relatively robust to the number of input frames in short sequences.
Task Dependency: The impact of map accuracy varies by task. Minimum distance estimation is highly sensitive to map precision, while object counting is more robust to metric errors, relying more on object presence detection.
Future Direction: This work establishes a new paradigm for spatial intelligence in MLLMs, moving away from "black-box" reasoning toward interpretable, mathematically grounded spatial representations, paving the way for more reliable embodied AI agents.

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

1. The Problem: The "Pixelated" Map vs. The "Blueprint"

2. The Solution: Two-Stage Training (The "Simulator" and the "Real World")

3. How It Thinks: The "Architect" vs. The "Poet"

4. The Results: Why It Matters

The Big Picture

1. Problem Statement

2. Methodology: Video2Layout Framework

A. Data Preparation

B. Two-Stage Training Paradigm

3. Key Contributions

4. Experimental Results

5. Significance and Insights

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers