Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Imagine you want to teach a robot how to navigate a house, find a specific red chair, or understand that a lamp is "to the left" of a sofa. To do this, the robot needs Spatial Intelligence—the ability to understand the 3D world, not just flat pictures.

The problem? Teaching robots this way is incredibly hard and slow. Right now, scientists have to manually label thousands of 3D scans (like a human drawing boxes around every object in a 3D model), which is like trying to fill a swimming pool with a teaspoon. It's too slow, too expensive, and the data is limited.

Enter "Holi-Spatial."

Think of Holi-Spatial as a super-powered, automated 3D construction crew that can turn raw, messy video footage into a perfectly organized, labeled 3D world map—without a single human needing to draw a box or write a label.

Here is how it works, broken down into three simple steps using a creative analogy:

The Analogy: Building a 3D Lego City from a Video Tour

Imagine you have a shaky video of someone walking through a messy living room. You want to turn this video into a perfect, digital Lego city where every chair, lamp, and rug is a distinct, labeled block.

Step 1: The "Ghost Hunter" (Geometric Optimization)

First, the system takes the video and tries to build a 3D skeleton of the room.

The Problem: If you just use a standard camera app, the 3D model looks like a foggy ghost town. There are "floaters" (ghostly bits of furniture floating in mid-air) and blurry edges.
The Holi-Spatial Fix: It uses a technique called 3D Gaussian Splatting. Imagine this as a high-tech "fog cleaner." It takes all the different angles from the video and smoothes out the fog, removing the ghosts and sharpening the edges until the 3D structure is solid, clean, and physically accurate. Now, the room has a real shape.

Step 2: The "Eagle-Eyed Detective" (Image-Level Perception)

Now that the room has a shape, the system needs to know what the objects are.

The Problem: A robot might see a "red thing" and think it's a ball, a shirt, or a chair.
The Holi-Spatial Fix: It uses a super-smart AI (a Vision-Language Model) that acts like a detective with a magnifying glass. It looks at key frames of the video and says, "That's a 'vibrant red fabric sofa with blue pillows'."
The Magic: It doesn't just guess; it draws a perfect 2D outline (a mask) around the sofa. Then, it "lifts" that 2D outline into the 3D space we built in Step 1, turning the flat drawing into a 3D block.

Step 3: The "Project Manager" (Scene-Level Refinement)

This is where the magic really happens. Because the video has many angles, the system might have created three different "sofa" blocks for the same real sofa (one from the left, one from the right, one from the back).

The Problem: You don't want three sofas in your digital city; you want one.
The Holi-Spatial Fix: A "Project Manager" AI steps in. It looks at all the candidates, checks if they overlap, and merges them into one perfect 3D sofa.
- If a candidate looks suspicious (low confidence), the Manager calls in a VLM Agent (a second AI) to zoom in and double-check: "Is this really a sofa, or just a pile of blankets?"
- Once verified, the system writes a detailed description for the sofa and generates Question & Answer (QA) pairs.
- Example QA: "If you are standing at the door, is the sofa in front of you or behind you?"

The Result: Holi-Spatial-4M

By running this automated pipeline on thousands of hours of video, the researchers created Holi-Spatial-4M.

Scale: It's massive. It contains 4 million annotations (labels, boxes, and questions).
Diversity: Unlike old datasets that only had 50 types of objects (like "chair" or "table"), this one knows about "vintage wooden lanterns," "smart fridges," and "patterned throw pillows."
Quality: It's so good that when they used it to train other AI models, those models became 64% better at finding objects in 3D space and 15% better at understanding spatial relationships.

Why This Matters

Before Holi-Spatial, building a smart 3D AI was like trying to build a skyscraper by hand, one brick at a time. Holi-Spatial is like a 3D printer that can print a whole city of labeled data overnight.

This means we can now train robots, self-driving cars, and AR glasses to understand the real world much faster, cheaper, and more accurately than ever before. It turns the chaotic internet of videos into a structured library of 3D knowledge, ready for the next generation of smart machines.

Here is a detailed technical summary of the paper "Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence."

1. Problem Statement

The field of Spatial Intelligence—enabling Large Multimodal Models (LMMs) to perceive, ground, and reason about the 3D world—faces a critical bottleneck: data scarcity and scalability.

Limitations of Existing Approaches: Current methods rely heavily on manually annotated 3D datasets (e.g., ScanNet, ScanNet++) or generate QA pairs from limited static scans. These approaches suffer from:
- Scalability Issues: Manual annotation is expensive and slow, preventing the creation of massive datasets.
- Domain Gaps: Models trained on narrow, curated datasets fail to generalize to diverse real-world environments.
- Semantic Limitations: Existing datasets often have closed vocabularies (e.g., only 50 classes in ScanNet) and lack fine-grained semantic details.
- Hardware Dependency: Many methods require specialized 3D scanning hardware or human-in-the-loop labeling.

The paper argues that to achieve true spatial intelligence, a system must be able to automatically convert raw video streams (abundant web data) into high-fidelity, holistic 3D spatial annotations without human intervention.

2. Methodology: The Holi-Spatial Pipeline

The authors propose Holi-Spatial, a fully automated, three-stage data curation pipeline that transforms raw video into 3D Gaussian Splatting (3DGS) scenes with dense semantic annotations.

Stage 1: Geometric Optimization

Goal: Distill high-fidelity 3D geometry from raw video.
Process:
1. Initialization: Uses Structure-from-Motion (SfM) and a state-of-the-art depth model (Depth-Anything-V3) to create an initial dense point cloud.
2. 3DGS Optimization: Refines the scene using 3D Gaussian Splatting (3DGS).
3. Regularization: Applies multi-view geometric consistency constraints to suppress "floaters" (noise) and enforce surface continuity.
Outcome: A clean, geometrically accurate 3D scene representation with rendered depth maps, serving as the foundation for subsequent steps.

Stage 2: Image-level Perception

Goal: Extract open-vocabulary object labels and 2D masks.
Process:
1. Keyframe Sampling: Selects representative frames from the video.
2. VLM Inference: Uses a Vision-Language Model (VLM, e.g., Gemini-3-Pro) to generate captions and maintain a dynamic class-label memory to ensure semantic consistency across frames.
3. Segmentation: Guides SAM3 (Segment Anything Model 3) to produce high-quality, open-vocabulary 2D instance masks based on the VLM's prompts.
4. 2D-to-3D Lifting: Back-projects 2D masks into 3D space using the refined depth maps and camera intrinsics.
5. Noise Mitigation: Employs a geometry-aware filtering strategy (mask erosion and mesh-guided depth filtering) to remove edge artifacts and depth outliers before generating initial 3D Oriented Bounding Boxes (OBBs).

Stage 3: Scene-level Refinement

Goal: Merge, verify, and caption instances to create a globally consistent 3D dataset.
Process:
1. Multi-View Merge: Clusters 3D candidates from different views. If instances share the same category and have high 3D IoU (>0.2), they are merged into a single 3D bounding box.
2. Gravity Alignment: Detects the floor plane to align all OBBs with the global up-axis, correcting roll/pitch errors.
3. Confidence Filtering & Agent Verification:
  - High Confidence ( $\ge 0.9$ ): Automatically accepted.
  - Low Confidence ( $< 0.8$ ): Discarded.
  - Ambiguous ($0.8 - 0.9$): Passed to a VLM-based Agent equipped with image zoom and re-segmentation tools to re-evaluate and verify the instance.
4. Annotation Generation: For verified instances, the system generates:
  - Fine-grained Captions: Describing object appearance.
  - Spatial QA Pairs: Procedurally synthesized questions covering 3D grounding, spatial reasoning (direction, distance, rotation), and attribute identification.

3. Key Contributions

Holi-Spatial Framework: The first fully automated pipeline to convert raw video into holistic 3D spatial annotations without human intervention or 3D sensors.
Holi-Spatial-4M Dataset: A large-scale dataset containing:
- 12,000 optimized 3DGS scenes.
- 1.3M 2D instance masks.
- 320K 3D bounding boxes and instance captions.
- 1.2M 3D grounding instances.
- 1.2M Spatial QA pairs (covering camera-centric and object-centric tasks).
Open-Vocabulary & Granularity: Unlike closed-set datasets, Holi-Spatial supports open-vocabulary categories and provides finer-grained boundaries and semantic details than official annotations (e.g., on ScanNet).
Performance Leap: Demonstrates that automated pipelines can outperform human-annotated baselines in geometric fidelity and semantic accuracy.

4. Experimental Results

The authors evaluated the pipeline on ScanNet, ScanNet++, and DL3DV-10K.

Geometric & Detection Performance:
- Depth Estimation: Achieved a Depth F1-score of 0.89 on ScanNet++, significantly outperforming M3-Spatial (0.39) and LangSplat (0.21).
- 3D Object Detection: Achieved 81.06 AP25 and 70.05 AP50 on ScanNet++, surpassing the previous state-of-the-art 3D-VLM (LLaVA-3D) by an order of magnitude (12.2 AP25).
- 2D Segmentation: Reached 0.64 IoU, outperforming SA2VA (0.25).
VLM Fine-tuning:
- Fine-tuning Qwen3-VL on Holi-Spatial-4M led to substantial improvements in downstream tasks:
  - 3D Grounding: +15% AP50 gain on ScanNet++.
  - Spatial Reasoning: +7.9% accuracy on MMSI-Bench and significant gains on MindCube benchmarks.
- The model showed reduced viewpoint bias and improved ability to ground objects across different views.

5. Significance and Impact

Scalability: By automating the annotation process, Holi-Spatial breaks the reliance on expensive human labeling, enabling the creation of datasets at a scale previously impossible for 3D spatial tasks.
Data Flywheel: The system creates a positive feedback loop where better models generate better data, which in turn trains better models.
Holistic Understanding: It unifies geometric reconstruction, semantic segmentation, and spatial reasoning into a single framework, addressing the fragmentation seen in previous works (which often specialized in only geometry or only language).
Future Applications: The dataset and pipeline are critical for advancing robotics (manipulation, navigation), augmented reality, and scene editing, providing the "spatial brain" necessary for agents to interact with the real 3D world.

In conclusion, Holi-Spatial represents a paradigm shift from manual, small-scale 3D data curation to automated, large-scale, video-driven spatial intelligence, setting a new standard for the quality and scale of 3D multimodal datasets.