Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence
This paper introduces Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset constructed from raw video streams without human intervention, which provides 4 million high-quality 3D semantic annotations and spatial QA pairs to significantly enhance the training and performance of Vision-Language Models on spatial reasoning tasks.
Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong2026-03-10💻 cs