HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training

This paper introduces HARP, a deep learning framework that harmonizes multi-site in-vivo diffusion MRI data by training exclusively on easily transportable phantom scans, thereby eliminating the need for impractical multi-site human cohorts while significantly reducing inter-scanner variability.

Hwihun Jeong, Qiang Liu, Kathryn E. Keenan, Elisabeth A. Wilde, Walter Schneider, Sudhir Pathak, Anthony Zuccolotto, Lauren J. O'Donnell, Lipeng Ning, Yogesh Rathi2026-03-10💻 cs

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

This paper introduces a method that enhances medical Vision-Language Models by using sequential eye-tracking data as supervision to train dedicated gaze tokens, enabling the models to mimic radiologists' visual search patterns and achieve state-of-the-art performance in both in-domain and out-of-domain medical reasoning tasks.

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao2026-03-10💻 cs

Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer

This paper investigates the severe dimensional collapse and resulting robustness fragility that occur when distilling a large Vision Transformer into capacity-constrained CNNs, revealing that while larger student models pack information densely but lose noise immunity, extremely small models act as robust low-pass filters due to fundamental geometric limitations in asymmetric cross-modal transfer.

Kabir Thayani2026-03-10💻 cs

SIQA: Toward Reliable Scientific Image Quality Assessment

This paper introduces the SIQA framework, which redefines scientific image quality assessment by distinguishing between perceptual alignment and scientific correctness, and demonstrates through a new benchmark that current multimodal models often achieve high scoring consistency with experts while lacking genuine scientific understanding.

Wenzhe Li, Liang Chen, Junying Wang, Yijing Guo, Ye Shen, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai2026-03-10💻 cs

On the Generalization Capacities of MLLMs for Spatial Intelligence

This paper argues that RGB-only Multimodal Large Language Models fail to generalize across different cameras due to entangled perspective and object properties, and proposes a Camera-Aware MLLM framework that integrates camera intrinsics, augmented data, and 3D geometric priors to achieve robust, generalizable spatial intelligence.

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, Ran Xu2026-03-10🤖 cs.LG

UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms

This paper introduces Universal Watermark Presence Detection (UWPD), a novel task for identifying invisible watermarks without prior algorithm knowledge, supported by the UniFreq-100K dataset and the Frequency Shield Network (FSNet) model that achieves superior zero-shot detection by dynamically amplifying high-frequency watermark signals while suppressing semantic content.

Xiang Ao, Yiling Du, Zidan Wang, Mengru Chen2026-03-10💻 cs

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

This paper introduces the Open-Vocabulary Temporal Sentence Grounding (OV-TSGV) task with new benchmarks (Charades-OV and ActivityNet-OV) and proposes HERO, a hierarchical embedding-refinement framework that achieves state-of-the-art performance by effectively generalizing to novel linguistic expressions through multi-level semantic modeling and cross-modal refinement.

Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu2026-03-10💻 cs

Vessel-Aware Deep Learning for OCTA-Based Detection of AMD

This paper proposes a vessel-aware deep learning framework for detecting age-related macular degeneration (AMD) in OCTA images by integrating external multiplicative attention with clinically meaningful vascular biomarkers, specifically tortuosity and dropout maps, to guide the model toward physiologically relevant regions and improve interpretability.

Margalit G. Mitzner, Moinak Bhattacharya, Zhilin Zou, Chao Chen, Prateek Prasanna2026-03-10💻 cs

HiDE: Hierarchical Dictionary-Based Entropy Modeling for Learned Image Compression

The paper proposes HiDE, a hierarchical dictionary-based entropy modeling framework for learned image compression that enhances coding efficiency by decomposing external priors into global and local dictionaries with cascaded retrieval and employing a context-aware parameter estimator to achieve significant BD-rate savings over state-of-the-art methods.

Haoxuan Xiong, Yuanyuan Xu, Kun Zhu, Yiming Wang, Baoliu Ye2026-03-10💻 cs

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

This paper establishes that the quality of a model's step-level visual grounding, quantified by the Step Grounding Rate (SGR), serves as a robust and independent predictor of out-of-distribution generalization in long-horizon vision-language models, outperforming traditional final-answer accuracy metrics.

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin2026-03-10💻 cs

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

This paper introduces MotionBits, a novel concept and learning-free segmentation method that identifies the smallest manipulable rigid bodies through kinematic spatial twist equivalence, outperforming state-of-the-art embodied perception models on the new MoRiBo benchmark and enabling more effective downstream robotic manipulation and reasoning tasks.

Howard H. Qian, Kejia Ren, Yu Xiang, Vicente Ordonez, Kaiyu Hang2026-03-10💻 cs

Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction

This paper introduces Perturbed Gaussian Ensemble, an active view selection framework for sparse-view CT that leverages stochastic density scaling of uncertain Gaussian primitives to identify high-variance projections, thereby significantly improving reconstruction fidelity and reducing geometric artifacts compared to existing methods.

Yulun Wu, Ruyi Zha, Wei Cao, Yingying Li, Yuanhao Cai, Yaoyao Liu2026-03-10💻 cs