Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

This paper introduces Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset constructed from raw video streams without human intervention, which provides 4 million high-quality 3D semantic annotations and spatial QA pairs to significantly enhance the training and performance of Vision-Language Models on spatial reasoning tasks.

Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you want to teach a robot how to navigate a house, find a specific red chair, or understand that a lamp is "to the left" of a sofa. To do this, the robot needs Spatial Intelligence—the ability to understand the 3D world, not just flat pictures.

The problem? Teaching robots this way is incredibly hard and slow. Right now, scientists have to manually label thousands of 3D scans (like a human drawing boxes around every object in a 3D model), which is like trying to fill a swimming pool with a teaspoon. It's too slow, too expensive, and the data is limited.

Enter "Holi-Spatial."

Think of Holi-Spatial as a super-powered, automated 3D construction crew that can turn raw, messy video footage into a perfectly organized, labeled 3D world map—without a single human needing to draw a box or write a label.

Here is how it works, broken down into three simple steps using a creative analogy:

The Analogy: Building a 3D Lego City from a Video Tour

Imagine you have a shaky video of someone walking through a messy living room. You want to turn this video into a perfect, digital Lego city where every chair, lamp, and rug is a distinct, labeled block.

Step 1: The "Ghost Hunter" (Geometric Optimization)

First, the system takes the video and tries to build a 3D skeleton of the room.

  • The Problem: If you just use a standard camera app, the 3D model looks like a foggy ghost town. There are "floaters" (ghostly bits of furniture floating in mid-air) and blurry edges.
  • The Holi-Spatial Fix: It uses a technique called 3D Gaussian Splatting. Imagine this as a high-tech "fog cleaner." It takes all the different angles from the video and smoothes out the fog, removing the ghosts and sharpening the edges until the 3D structure is solid, clean, and physically accurate. Now, the room has a real shape.

Step 2: The "Eagle-Eyed Detective" (Image-Level Perception)

Now that the room has a shape, the system needs to know what the objects are.

  • The Problem: A robot might see a "red thing" and think it's a ball, a shirt, or a chair.
  • The Holi-Spatial Fix: It uses a super-smart AI (a Vision-Language Model) that acts like a detective with a magnifying glass. It looks at key frames of the video and says, "That's a 'vibrant red fabric sofa with blue pillows'."
  • The Magic: It doesn't just guess; it draws a perfect 2D outline (a mask) around the sofa. Then, it "lifts" that 2D outline into the 3D space we built in Step 1, turning the flat drawing into a 3D block.

Step 3: The "Project Manager" (Scene-Level Refinement)

This is where the magic really happens. Because the video has many angles, the system might have created three different "sofa" blocks for the same real sofa (one from the left, one from the right, one from the back).

  • The Problem: You don't want three sofas in your digital city; you want one.
  • The Holi-Spatial Fix: A "Project Manager" AI steps in. It looks at all the candidates, checks if they overlap, and merges them into one perfect 3D sofa.
    • If a candidate looks suspicious (low confidence), the Manager calls in a VLM Agent (a second AI) to zoom in and double-check: "Is this really a sofa, or just a pile of blankets?"
    • Once verified, the system writes a detailed description for the sofa and generates Question & Answer (QA) pairs.
    • Example QA: "If you are standing at the door, is the sofa in front of you or behind you?"

The Result: Holi-Spatial-4M

By running this automated pipeline on thousands of hours of video, the researchers created Holi-Spatial-4M.

  • Scale: It's massive. It contains 4 million annotations (labels, boxes, and questions).
  • Diversity: Unlike old datasets that only had 50 types of objects (like "chair" or "table"), this one knows about "vintage wooden lanterns," "smart fridges," and "patterned throw pillows."
  • Quality: It's so good that when they used it to train other AI models, those models became 64% better at finding objects in 3D space and 15% better at understanding spatial relationships.

Why This Matters

Before Holi-Spatial, building a smart 3D AI was like trying to build a skyscraper by hand, one brick at a time. Holi-Spatial is like a 3D printer that can print a whole city of labeled data overnight.

This means we can now train robots, self-driving cars, and AR glasses to understand the real world much faster, cheaper, and more accurately than ever before. It turns the chaotic internet of videos into a structured library of 3D knowledge, ready for the next generation of smart machines.