cs.CV papers | Gist.Science

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

This paper demonstrates that frozen vision-language model features contain rich, continuous geometric information that outperforms text-based outputs by 3.3x, revealing that the accuracy bottleneck stems from training objectives and autoregressive generation rather than representational limitations, as evidenced by high-precision linear probes and consistent performance across diverse encoder architectures.

Yakov Pyotr Shkolnikov2026-03-09🤖 cs.AI

GreenRFM: Toward a resource-efficient radiology foundation model

GreenRFM introduces a resource-efficient pre-training framework utilizing principled "MUST" supervision to achieve state-of-the-art radiology foundation model performance with significantly reduced computational requirements, challenging the prevailing "scale is all you need" paradigm.

Yingtai Li, Shuai Ming, Mingyue Zhao, Haoran Lai, Rongsheng Wang, Rui Zhou, Rundong Wang, Yujia Li, Wei Wei, Shaohua Kevin Zhou2026-03-09💻 cs

Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Match4Annotate is a lightweight framework that enables efficient, high-quality propagation of sparse point and mask annotations across and within video sequences by fitting test-time implicit neural representations to DINOv3 features, offering a scalable solution for annotation bottlenecks in specialized domains like medical imaging.

Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony2026-03-09💻 cs

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

This paper introduces Self-Flow, a self-supervised flow matching paradigm that utilizes a Dual-Timestep Scheduling mechanism to integrate representation learning directly into the generative framework, thereby eliminating the need for external models and achieving superior, scalable multi-modal synthesis across image, video, and audio.

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach2026-03-09✓ Author reviewed ⓘ💻 cs

SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants

This paper introduces SG-DOR, a relational framework that utilizes a direction-aware graph neural network to infer scene graphs encoding physical attachments and direction-conditioned occlusion for robotic harvesting of pepper plants in dense canopies.

Rohit Menon, Niklas Mueller-Goldingen, Sicong Pan, Gokul Krishna Chenchani, Maren Bennewitz2026-03-09💻 cs

Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

This paper presents an artificial intelligence system trained on over 45,000 ultrasound images that achieves diagnostic accuracy comparable to senior radiologists for fetal orofacial clefts, significantly enhances junior radiologists' performance when used as a copilot, and accelerates clinical expertise development for rare conditions.

Yuanji Zhang, Yuhao Huang, Haoran Dou, Xiliang Zhu, Chen Ling, Zhong Yang, Lianying Liang, Jiuping Li, Siying Liang, Rui Li, Yan Cao, Yuhan Zhang, Jiewei Lai, Yongsong Zhou, Hongyu Zheng, Xinru Gao, Cheng Yu, Liling Shi, Mengqin Yuan, Honglong Li, Xiaoqiong Huang, Chaoyu Chen, Jialin Zhang, Wenxiong Pan, Alejandro F. Frangi, Guangzhi He, Xin Yang, Yi Xiong, Linliang Yin, Xuedong Deng, Dong Ni2026-03-09🤖 cs.AI

SCAN: Visual Explanations with Self-Confidence and Analysis Networks

This paper introduces SCAN, a universal framework that leverages an AutoEncoder-based approach guided by the Information Bottleneck principle to generate high-resolution, faithful self-confidence maps, effectively overcoming the trade-off between fidelity and applicability in visual explanations for both CNN and Transformer architectures.

Gwanghee Lee, Sungyoon Jeong, Kyoungson Jhang2026-03-09💻 cs

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

The paper proposes AV-Unified, a novel framework that unifies diverse audio-visual scene understanding tasks into a single architecture by converting heterogeneous inputs and outputs into discrete token sequences and employing multi-scale spatiotemporal perception modules to effectively capture cross-modal associations.

Guangyao Li, Xin Wang, Wenwu Zhu2026-03-09💻 cs

Spatial Calibration of Diffuse LiDARs

This paper introduces a spatial calibration method for diffuse direct time-of-flight LiDARs that estimates per-pixel footprints and sensitivity maps in an RGB image plane using a scanned retroreflective patch, thereby enabling accurate cross-modal alignment and fusion despite the violation of standard single-ray assumptions.

Nikhil Behari, Ramesh Raskar2026-03-09💻 cs

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

This paper introduces NEGATE, a training-free framework that models linguistic negation in text-to-video diffusion as a structured feasibility constraint, enabling robust and coherent generation of negated concepts by projecting semantic updates onto a convex set derived from linguistic structure without retraining the underlying models.

Taewon Kang, Ming C. Lin2026-03-09💻 cs

SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference

The paper introduces SurgFormer, a scalable multiresolution gated transformer that enables near real-time, high-fidelity soft tissue simulation on volumetric meshes by learning to predict nodewise displacements and handling topology-altering resections through a unified, XFEM-supervised framework.

Ashkan Shahbazi, Elaheh Akbari, Kyvia Pereira, Jon S. Heiselman, Annie C. Benson, Garrison L. H. Johnston, Jie Ying Wu, Nabil Simaan, Michael I. Miga, Soheil Kolouri2026-03-09💻 cs

Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

This paper investigates redundancy as a critical yet underexplored data quality factor in autonomous driving by modeling and measuring it across multisource and multimodal datasets, demonstrating that selectively removing redundant labels from overlapping camera views and image-LiDAR pairs can improve or maintain object detection performance while advocating for a data-centric approach to AV dataset optimization.

Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha2026-03-09💻 cs

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

The paper introduces EgoReasoner, a two-stage framework that employs task-adaptive thinking templates and task-aware reinforcement learning to overcome the limitations of generic reasoning methods, enabling a compact 3B-parameter model to significantly outperform larger vision-language models on complex egocentric 4D reasoning tasks.

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel2026-03-09💻 cs

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Penguin-VL challenges the reliance on massive contrastive pretraining for vision encoders by introducing an LLM-initialized encoder that achieves superior performance in fine-grained perception and complex reasoning tasks with compact, compute-efficient models.

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang2026-03-09💻 cs

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

The paper introduces SUREON, a large-scale surgical video QA dataset derived from academic lecture narrations, and presents two specialized vision-language models that demonstrate superior surgical reasoning capabilities by explicitly interpreting intent, rationale, and future steps in surgical scenes.

Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri2026-03-09🤖 cs.AI

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

SCOPE introduces a plug-and-play framework for incremental few-shot 3D segmentation that enriches novel class prototypes by retrieving and fusing high-confidence pseudo-instances from unlabelled background regions, thereby achieving state-of-the-art performance on ScanNet and S3DIS while mitigating catastrophic forgetting without retraining the backbone.

Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu2026-03-09🤖 cs.LG

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

The paper proposes BEVLM, a framework that bridges the gap between spatially consistent Bird's-Eye View representations and Large Language Models by distilling semantic knowledge, thereby significantly enhancing both cross-view reasoning accuracy and safety-critical end-to-end driving performance.

Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding2026-03-09🤖 cs.AI

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Omni-Diffusion introduces the first any-to-any multimodal language model that unifies text, speech, and image understanding and generation by leveraging a novel mask-based discrete diffusion architecture, demonstrating performance comparable to or exceeding existing autoregressive multimodal systems.

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu2026-03-09💻 cs

Multimodal Large Language Models as Image Classifiers

This paper demonstrates that the perceived underperformance of Multimodal Large Language Models (MLLMs) in image classification is largely an artifact of flawed evaluation protocols and noisy ground truth rather than genuine model deficiencies, revealing that correcting these issues significantly narrows the performance gap with supervised models while highlighting the potential of MLLMs to assist in large-scale dataset curation.

Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas2026-03-09💻 cs

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

This paper introduces the Evolutionary Illusion GENerator (EIGen), a generative model based on video predictive neural networks that creates new visual motion illusions, which are confirmed to fool human participants, thereby supporting the hypothesis that such illusions arise from the brain's predictive processing rather than raw visual input and highlighting the value of studying "motivated failures" in AI research.

Lana Sinapayen, Eiji Watanabe2026-03-06💻 cs

← Previous Next →