Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

This paper demonstrates that frozen vision-language model features contain rich, continuous geometric information that outperforms text-based outputs by 3.3x, revealing that the accuracy bottleneck stems from training objectives and autoregressive generation rather than representational limitations, as evidenced by high-precision linear probes and consistent performance across diverse encoder architectures.

Yakov Pyotr Shkolnikov2026-03-09🤖 cs.AI

Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Match4Annotate is a lightweight framework that enables efficient, high-quality propagation of sparse point and mask annotations across and within video sequences by fitting test-time implicit neural representations to DINOv3 features, offering a scalable solution for annotation bottlenecks in specialized domains like medical imaging.

Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony2026-03-09💻 cs

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

This paper introduces Self-Flow, a self-supervised flow matching paradigm that utilizes a Dual-Timestep Scheduling mechanism to integrate representation learning directly into the generative framework, thereby eliminating the need for external models and achieving superior, scalable multi-modal synthesis across image, video, and audio.

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach2026-03-09✓ Author reviewed 💻 cs

Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

This paper presents an artificial intelligence system trained on over 45,000 ultrasound images that achieves diagnostic accuracy comparable to senior radiologists for fetal orofacial clefts, significantly enhances junior radiologists' performance when used as a copilot, and accelerates clinical expertise development for rare conditions.

Yuanji Zhang, Yuhao Huang, Haoran Dou, Xiliang Zhu, Chen Ling, Zhong Yang, Lianying Liang, Jiuping Li, Siying Liang, Rui Li, Yan Cao, Yuhan Zhang, Jiewei Lai, Yongsong Zhou, Hongyu Zheng, Xinru Gao, Cheng Yu, Liling Shi, Mengqin Yuan, Honglong Li, Xiaoqiong Huang, Chaoyu Chen, Jialin Zhang, Wenxiong Pan, Alejandro F. Frangi, Guangzhi He, Xin Yang, Yi Xiong, Linliang Yin, Xuedong Deng, Dong Ni2026-03-09🤖 cs.AI

SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference

The paper introduces SurgFormer, a scalable multiresolution gated transformer that enables near real-time, high-fidelity soft tissue simulation on volumetric meshes by learning to predict nodewise displacements and handling topology-altering resections through a unified, XFEM-supervised framework.

Ashkan Shahbazi, Elaheh Akbari, Kyvia Pereira, Jon S. Heiselman, Annie C. Benson, Garrison L. H. Johnston, Jie Ying Wu, Nabil Simaan, Michael I. Miga, Soheil Kolouri2026-03-09💻 cs

Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

This paper investigates redundancy as a critical yet underexplored data quality factor in autonomous driving by modeling and measuring it across multisource and multimodal datasets, demonstrating that selectively removing redundant labels from overlapping camera views and image-LiDAR pairs can improve or maintain object detection performance while advocating for a data-centric approach to AV dataset optimization.

Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha2026-03-09💻 cs

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

The paper introduces EgoReasoner, a two-stage framework that employs task-adaptive thinking templates and task-aware reinforcement learning to overcome the limitations of generic reasoning methods, enabling a compact 3B-parameter model to significantly outperform larger vision-language models on complex egocentric 4D reasoning tasks.

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel2026-03-09💻 cs

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

SCOPE introduces a plug-and-play framework for incremental few-shot 3D segmentation that enriches novel class prototypes by retrieving and fusing high-confidence pseudo-instances from unlabelled background regions, thereby achieving state-of-the-art performance on ScanNet and S3DIS while mitigating catastrophic forgetting without retraining the backbone.

Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu2026-03-09🤖 cs.LG

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Omni-Diffusion introduces the first any-to-any multimodal language model that unifies text, speech, and image understanding and generation by leveraging a novel mask-based discrete diffusion architecture, demonstrating performance comparable to or exceeding existing autoregressive multimodal systems.

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu2026-03-09💻 cs

Multimodal Large Language Models as Image Classifiers

This paper demonstrates that the perceived underperformance of Multimodal Large Language Models (MLLMs) in image classification is largely an artifact of flawed evaluation protocols and noisy ground truth rather than genuine model deficiencies, revealing that correcting these issues significantly narrows the performance gap with supervised models while highlighting the potential of MLLMs to assist in large-scale dataset curation.

Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas2026-03-09💻 cs

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

This paper introduces the Evolutionary Illusion GENerator (EIGen), a generative model based on video predictive neural networks that creates new visual motion illusions, which are confirmed to fool human participants, thereby supporting the hypothesis that such illusions arise from the brain's predictive processing rather than raw visual input and highlighting the value of studying "motivated failures" in AI research.

Lana Sinapayen, Eiji Watanabe2026-03-06💻 cs