SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

Imagine you walk into a friend's house for the first time. You look around, see a red mug on a table near a window, and then you leave. A few days later, your friend asks, "Hey, where did I put that red mug?"

If you only have a short-term memory, you might say, "I think it was on a table, but I'm not sure which one." But if you had a perfect, 3D mental map of that house, you could instantly say, "It's on the wooden table, directly to the left of the big window, about three steps from the front door."

SpatialMem is a computer system designed to build that perfect, 3D mental map for robots and AI assistants, using nothing but a standard video camera (like the one on your phone).

Here is how it works, broken down into simple concepts:

1. The Problem: The "Amnesia" of Current AI

Most AI systems today are like a person watching a movie frame-by-frame. They see a picture of a mug, then a picture of a sofa, but they don't really understand where those things are in relation to each other in 3D space. They also struggle to remember things after the video ends. To build a 3D map, robots usually need expensive, specialized hardware (like laser scanners).

SpatialMem wants to do this with just a regular video camera, turning a messy, casual video walk-through into a structured, searchable 3D database.

2. The Solution: Building a "Digital Skeleton"

Think of the system as a construction crew building a house, but instead of bricks, they are building a 3D memory tree.

Step 1: The Foundation (The Skeleton):
First, the system watches the video and figures out the "bones" of the room. It ignores the clutter for a moment and identifies the permanent structures: the walls, the doors, and the windows.
- Analogy: Imagine drawing a blueprint of a room on a piece of paper. You draw the walls and the doorways first. These are your Anchors. They don't move, so they are the perfect reference points.
Step 2: Filling in the Furniture (The Objects):
Next, it looks at the stuff inside the room (the sofa, the mug, the lamp) and attaches them to the blueprint. It doesn't just say "there is a mug." It says, "The mug is on the table, which is next to the window."
- Analogy: Now you are placing furniture on your blueprint. You know exactly how far the sofa is from the wall because you measured it in 3D space.
Step 3: The Two-Layer Note-Taking (The Description):
This is the system's secret sauce. For every object, it writes two types of notes:
1. The Snapshot: "Right now, the mug looks red and is slightly tilted." (This changes if the camera moves).
2. The Permanent Fact: "The mug is a ceramic cup, located on the kitchen table, near the north wall." (This stays true no matter where you look).
- Analogy: It's like having a sticky note on a photo (temporary view) and a permanent label in a filing cabinet (stable fact). This helps the AI answer questions even if the lighting changes or the object is partially hidden.

3. How You Use It: The "GPS" for Questions

Once the memory is built, you can ask the system questions in plain English, and it acts like a GPS for your memory.

Question: "Where is the red mug?"
The System's Thought Process:
1. It looks at its "blueprint" (the 3D anchors).
2. It finds the "North Wall" anchor.
3. It finds the "Window" anchor near that wall.
4. It finds the "Table" anchor near the window.
5. It finds the "Red Mug" attached to that table.
6. Answer: "The red mug is on the table, next to the window on the north wall."

It can also give navigation instructions: "Go straight, turn left at the door, pass the TV, and the sofa is near the window."

4. Why This is a Big Deal

No Expensive Gear: You don't need a robot with a laser scanner. You can just use a phone camera.
It Understands "Where": It doesn't just recognize objects; it understands the distance and direction between them.
It Handles Mess: Even if the room is cluttered or the video is shaky, the system focuses on the permanent "skeleton" (walls/doors) to keep its bearings, so it doesn't get lost.

The Bottom Line

SpatialMem is like giving an AI a permanent, 3D diary of a room. Instead of just remembering "I saw a mug," it remembers "The mug is 2 meters from the door, to the left of the window, on a blue table." This allows robots and AR assistants to navigate and answer complex questions about our world using only a simple video camera.

1. Problem Statement

The paper addresses the challenge of enabling autonomous agents (e.g., AR assistants, mobile robots) to understand, reason about, and navigate complex 3D indoor environments over long time horizons using only casual, egocentric RGB video.

Current systems face several critical barriers:

Hardware Dependency: Many rely on specialized sensors (RGB-D, IMUs) or calibrated Visual-SLAM, limiting accessibility.
Lack of Metric Consistency: Existing memory systems often rely on 2D frame-local representations or open-vocabulary scene graphs that lack a shared, upright metric coordinate frame. This makes precise spatial reasoning (e.g., "3 meters behind the sofa," "left of the door") difficult.
Hierarchical Complexity: Indoor scenes are hierarchical (walls define rooms, rooms contain objects), but capturing this structure and maintaining it amidst occlusion and clutter is challenging.
Query Latency: Systems often struggle to provide low-latency, interpretable answers to relational queries after memory construction.

Goal: Build a unified, hierarchical, metric-aligned 3D memory system from monocular RGB video that supports language-grounded retrieval, spatial reasoning, and offline navigation-style guidance without specialized hardware.

2. Methodology: SpatialMem Pipeline

SpatialMem constructs a rooted memory tree that unifies geometry, semantics, and language. The pipeline consists of five key stages:

A. 3D Environment Preparation (Geometry & Alignment)

Input: Casual egocentric RGB video.
Reconstruction: Uses a swappable geometry backend (e.g., VGGT, SLAM3R, or COLMAP) to recover camera poses and dense depth, fusing them into a view-consistent point cloud.
Metric Alignment:
- Detects the floor plane to align the global Z-axis (upright).
- Applies a height prior (floor-to-ceiling distance) to set the metric scale.
- Result: A stable, allocentric, metric 3D frame.

B. Structural Anchor Detection (Level-1)

Identifies stable structural elements as the "scaffold" of the memory: Walls, Doors, and Windows.
These are detected as large vertical planes or thin vertical openings (using edge/depth gaps).
Stability Check: Anchors are only promoted if they have sufficient point support, multi-view coverage, and temporal persistence.

C. Hierarchical Memory Construction

The system organizes data into a four-layer tree structure ( $T = (V, E)$ ):

Root: Scene metadata and global frame.
Level-1 (Anchors): Structural elements (walls, doors, windows) with geometry and semantics.
Level-2 (Objects): Open-vocabulary object instances linked to 3D bounding boxes and multi-view 2D crops/masks. Objects are attached to the nearest Level-1 anchors.
Level-3 (Descriptions): A Two-Layer Description Mechanism for each object:
- Layer 1 (Image-level): View-specific details (attributes, current relations).
- Layer 2 (Scene-level): Stable, consensus-based summaries across multiple views (attributes and relations to anchors). This prevents drift and ensures consistency.

D. Metric Grounding & Relational Semantics

Vertical Relations: Evaluated globally in the aligned frame (e.g., "on," "above," "below") using geometric heuristics (footprint overlap, height difference).
Lateral Relations: Initially recorded as egocentric tags (left/right) but consolidated into allocentric relations relative to anchors (e.g., "left of the north wall") when cross-view evidence is sufficient.

E. Query and Retrieval

Low-Latency Indexing: Nodes are organized by type and 3D region.
Query Types:
- Locate: Finds nodes matching names and simple geometry.
- Relational: Traverses the graph (e.g., Wall $\to$ Window $\to$ Mug) checking distances and directions.
- Navigation: Converts natural language instructions into a sequence of anchor/object waypoints for step-wise guidance.

3. Key Contributions

Unified Metric Memory: A system built purely from RGB-only video that integrates geometry, semantics, and language into a single, queryable, metric-anchored tree structure.
Two-Layer Description Mechanism: A novel approach separating view-specific details from stable scene-level summaries, significantly improving compositional reasoning and path-level grounding.
3D-Grounded Open-Vocabulary Querying: Enables precise spatial reasoning (distance, direction, visibility) by grounding queries in structural anchors (walls, doors) rather than free-text descriptions.
Practical Low-Latency Architecture: Demonstrates efficient offline memory construction followed by fast, lightweight indexing for retrieval and navigation guidance without specialized sensors.

4. Experimental Results

The system was evaluated on three scenes:

Scene 1: Public Replica dataset (low clutter).
Scene 2: Real-world suite main room (moderate clutter).
Scene 3: Real-world laboratory/storage (high clutter/occlusion).

Performance Highlights:

Layout Understanding (Relative Position):
- Achieved 0.84 accuracy in Scene 1, 0.78 in Scene 2, and 0.74 in Scene 3.
- Outperformed or matched strong baselines (Google Gemini 2.5 Flash, InternVL, LLaVA-Video) in anchor-specific relations (Wall, Door, Window).
Navigation-Style Guidance:
- Achieved a Step Completion score of 0.89 in Scene 1 (beating Gemini's 0.84).
- Maintained strong performance (0.83 Step Completion) even in the highly cluttered Scene 3.
- Demonstrated competitive Success Rates (SRnav) and Path Length (SPL).
Object Retrieval:
- Achieved 0.83 retrieval success in Scene 1 and 0.72 in Scene 3.
- Showed superior hierarchical correctness (Acc_path) compared to baselines, indicating better understanding of object placement within the scene structure.
Robustness (Ablation):
- Removing the two-layer description caused a consistent drop in performance (e.g., SRnav dropped from 0.79 to 0.72), proving its necessity for long-horizon grounding.
- The system showed robustness to metric scale perturbations (±10%), with only minor degradation in path-level metrics.

5. Significance

Accessibility: By removing the need for depth sensors or IMUs, SpatialMem democratizes 3D spatial memory for consumer-grade devices (smartphones, AR glasses).
Interpretability: Unlike "black box" neural representations, SpatialMem provides an explicit, human-interpretable tree structure where answers can be traced back to specific 3D anchors and visual evidence.
Long-Horizon Reasoning: The hierarchical memory and two-layer description mechanism effectively solve the problem of maintaining spatial consistency over time and across varying levels of clutter, a critical step toward reliable autonomous agents in real-world indoor environments.
Practical Application: The system is designed for offline guidance, making it immediately applicable for AR navigation assistants, robot task planning, and video-based QA systems where real-time SLAM is not required.