Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

This paper demonstrates that frozen vision-language model features contain rich, continuous geometric information that outperforms text-based outputs by 3.3x, revealing that the accuracy bottleneck stems from training objectives and autoregressive generation rather than representational limitations, as evidenced by high-precision linear probes and consistent performance across diverse encoder architectures.

Yakov Pyotr Shkolnikov2026-03-09🤖 cs.AI

Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Match4Annotate is a lightweight framework that enables efficient, high-quality propagation of sparse point and mask annotations across and within video sequences by fitting test-time implicit neural representations to DINOv3 features, offering a scalable solution for annotation bottlenecks in specialized domains like medical imaging.

Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony2026-03-09💻 cs

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

This paper introduces Self-Flow, a self-supervised flow matching paradigm that utilizes a Dual-Timestep Scheduling mechanism to integrate representation learning directly into the generative framework, thereby eliminating the need for external models and achieving superior, scalable multi-modal synthesis across image, video, and audio.

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach2026-03-09✓ Author reviewed 💻 cs

Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

This paper presents an artificial intelligence system trained on over 45,000 ultrasound images that achieves diagnostic accuracy comparable to senior radiologists for fetal orofacial clefts, significantly enhances junior radiologists' performance when used as a copilot, and accelerates clinical expertise development for rare conditions.

Yuanji Zhang, Yuhao Huang, Haoran Dou, Xiliang Zhu, Chen Ling, Zhong Yang, Lianying Liang, Jiuping Li, Siying Liang, Rui Li, Yan Cao, Yuhan Zhang, Jiewei Lai, Yongsong Zhou, Hongyu Zheng, Xinru Gao, Cheng Yu, Liling Shi, Mengqin Yuan, Honglong Li, Xiaoqiong Huang, Chaoyu Chen, Jialin Zhang, Wenxiong Pan, Alejandro F. Frangi, Guangzhi He, Xin Yang, Yi Xiong, Linliang Yin, Xuedong Deng, Dong Ni2026-03-09🤖 cs.AI

SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference

The paper introduces SurgFormer, a scalable multiresolution gated transformer that enables near real-time, high-fidelity soft tissue simulation on volumetric meshes by learning to predict nodewise displacements and handling topology-altering resections through a unified, XFEM-supervised framework.

Ashkan Shahbazi, Elaheh Akbari, Kyvia Pereira, Jon S. Heiselman, Annie C. Benson, Garrison L. H. Johnston, Jie Ying Wu, Nabil Simaan, Michael I. Miga, Soheil Kolouri2026-03-09💻 cs

Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

This paper investigates redundancy as a critical yet underexplored data quality factor in autonomous driving by modeling and measuring it across multisource and multimodal datasets, demonstrating that selectively removing redundant labels from overlapping camera views and image-LiDAR pairs can improve or maintain object detection performance while advocating for a data-centric approach to AV dataset optimization.

Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha2026-03-09💻 cs

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

The paper introduces EgoReasoner, a two-stage framework that employs task-adaptive thinking templates and task-aware reinforcement learning to overcome the limitations of generic reasoning methods, enabling a compact 3B-parameter model to significantly outperform larger vision-language models on complex egocentric 4D reasoning tasks.

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel2026-03-09💻 cs

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

SCOPE introduces a plug-and-play framework for incremental few-shot 3D segmentation that enriches novel class prototypes by retrieving and fusing high-confidence pseudo-instances from unlabelled background regions, thereby achieving state-of-the-art performance on ScanNet and S3DIS while mitigating catastrophic forgetting without retraining the backbone.

Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu2026-03-09🤖 cs.LG

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Omni-Diffusion introduces the first any-to-any multimodal language model that unifies text, speech, and image understanding and generation by leveraging a novel mask-based discrete diffusion architecture, demonstrating performance comparable to or exceeding existing autoregressive multimodal systems.

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu2026-03-09💻 cs