cs.CV papers | Gist.Science

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

This paper introduces a method that enhances medical Vision-Language Models by using sequential eye-tracking data as supervision to train dedicated gaze tokens, enabling the models to mimic radiologists' visual search patterns and achieve state-of-the-art performance in both in-domain and out-of-domain medical reasoning tasks.

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao2026-03-10💻 cs

Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer

This paper investigates the severe dimensional collapse and resulting robustness fragility that occur when distilling a large Vision Transformer into capacity-constrained CNNs, revealing that while larger student models pack information densely but lose noise immunity, extremely small models act as robust low-pass filters due to fundamental geometric limitations in asymmetric cross-modal transfer.

Kabir Thayani2026-03-10💻 cs

Multi-label Instance-level Generalised Visual Grounding in Agriculture

This paper introduces gRef-CW, the first benchmark dataset for generalised visual grounding in agriculture that includes negative expressions, and proposes Weed-VG, a modular framework designed to overcome the domain gap and effectively localise crop and weed instances under challenging field conditions.

Mohammadreza Haghighat, Alzayat Saleh, Mostafa Rahimi Azghadi2026-03-10💻 cs

SIQA: Toward Reliable Scientific Image Quality Assessment

This paper introduces the SIQA framework, which redefines scientific image quality assessment by distinguishing between perceptual alignment and scientific correctness, and demonstrates through a new benchmark that current multimodal models often achieve high scoring consistency with experts while lacking genuine scientific understanding.

Wenzhe Li, Liang Chen, Junying Wang, Yijing Guo, Ye Shen, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai2026-03-10💻 cs

On the Generalization Capacities of MLLMs for Spatial Intelligence

This paper argues that RGB-only Multimodal Large Language Models fail to generalize across different cameras due to entangled perspective and object properties, and proposes a Camera-Aware MLLM framework that integrates camera intrinsics, augmented data, and 3D geometric priors to achieve robust, generalizable spatial intelligence.

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, Ran Xu2026-03-10🤖 cs.LG

Uncertainty-Aware Solar Flare Regression

This paper enhances the reliability of solar flare regression by applying conformal prediction to deep learning models, demonstrating that conformalized quantile regression outperforms alternative methods in achieving valid coverage rates and favorable interval lengths for space weather forecasting.

Jinsu Hong, Chetraj Pandey, Berkay Aydin2026-03-10🔭 astro-ph

UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms

This paper introduces Universal Watermark Presence Detection (UWPD), a novel task for identifying invisible watermarks without prior algorithm knowledge, supported by the UniFreq-100K dataset and the Frequency Shield Network (FSNet) model that achieves superior zero-shot detection by dynamically amplifying high-frequency watermark signals while suppressing semantic content.

Xiang Ao, Yiling Du, Zidan Wang, Mengru Chen2026-03-10💻 cs

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

This paper introduces the Open-Vocabulary Temporal Sentence Grounding (OV-TSGV) task with new benchmarks (Charades-OV and ActivityNet-OV) and proposes HERO, a hierarchical embedding-refinement framework that achieves state-of-the-art performance by effectively generalizing to novel linguistic expressions through multi-level semantic modeling and cross-modal refinement.

Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu2026-03-10💻 cs

Vessel-Aware Deep Learning for OCTA-Based Detection of AMD

This paper proposes a vessel-aware deep learning framework for detecting age-related macular degeneration (AMD) in OCTA images by integrating external multiplicative attention with clinically meaningful vascular biomarkers, specifically tortuosity and dropout maps, to guide the model toward physiologically relevant regions and improve interpretability.

Margalit G. Mitzner, Moinak Bhattacharya, Zhilin Zou, Chao Chen, Prateek Prasanna2026-03-10💻 cs

Heterogeneous Decentralized Diffusion Models

This paper introduces an efficient framework for heterogeneous decentralized diffusion models that enables experts to train with mixed objectives (DDPM and Flow Matching) and reduced resource requirements, achieving a 16x decrease in compute and 14x reduction in data compared to prior approaches while improving image quality and diversity.

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy2026-03-10🤖 cs.LG

ButterflyViT: 354 $\times$ Expert Compression for Edge Vision Transformers

ButterflyViT introduces a geometric parameterization method that treats Mixture of Experts as rotations of a shared quantized substrate, achieving a 354 $\times$ memory reduction for Vision Transformers on edge devices while maintaining accuracy through spatial smoothness regularization.

Aryan Karmore2026-03-10💻 cs

XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification

This paper introduces XMACNet, an explainable, lightweight CNN that combines self-attention mechanisms with multi-modal fusion of RGB images and vegetation indices to achieve high-accuracy chili disease classification suitable for edge deployment.

Tapon Kumer Ray, Rajkumar Y, Shalini R, Srigayathri K, Jayashree S, Lokeswari P2026-03-10💻 cs

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

This paper introduces EarthBridge, a high-fidelity cross-modal translation framework combining Diffusion Bridge Implicit Models and Contrastive Unpaired Translation to achieve second place in the 4th Multi-modal Aerial View Image Challenge by effectively translating between SAR, EO, and IR aerial imagery.

Zhenyuan Chen, Guanyuan Shen, Feng Zhang2026-03-10💻 cs

HiDE: Hierarchical Dictionary-Based Entropy Modeling for Learned Image Compression

The paper proposes HiDE, a hierarchical dictionary-based entropy modeling framework for learned image compression that enhances coding efficiency by decomposing external priors into global and local dictionaries with cascaded retrieval and employing a context-aware parameter estimator to achieve significant BD-rate savings over state-of-the-art methods.

Haoxuan Xiong, Yuanyuan Xu, Kun Zhu, Yiming Wang, Baoliu Ye2026-03-10💻 cs

A Hybrid Machine Learning Model for Cerebral Palsy Detection

This paper presents a hybrid machine learning model that combines VGG19, EfficientNet, and ResNet50 for feature extraction with a Bi-LSTM classifier to achieve a 98.83% accuracy in the early detection of Cerebral Palsy from MRI images, outperforming several individual pre-trained models.

Karan Kumar Singh, Nikita Gajbhiye, Gouri Sankar Mishra2026-03-10💻 cs

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

This paper establishes that the quality of a model's step-level visual grounding, quantified by the Step Grounding Rate (SGR), serves as a robust and independent predictor of out-of-distribution generalization in long-horizon vision-language models, outperforming traditional final-answer accuracy metrics.

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin2026-03-10💻 cs

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

This paper introduces MotionBits, a novel concept and learning-free segmentation method that identifies the smallest manipulable rigid bodies through kinematic spatial twist equivalence, outperforming state-of-the-art embodied perception models on the new MoRiBo benchmark and enabling more effective downstream robotic manipulation and reasoning tasks.

Howard H. Qian, Kejia Ren, Yu Xiang, Vicente Ordonez, Kaiyu Hang2026-03-10💻 cs

Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction

This paper introduces Perturbed Gaussian Ensemble, an active view selection framework for sparse-view CT that leverages stochastic density scaling of uncertain Gaussian primitives to identify high-variance projections, thereby significantly improving reconstruction fidelity and reducing geometric artifacts compared to existing methods.

Yulun Wu, Ruyi Zha, Wei Cao, Yingying Li, Yuanhao Cai, Yaoyao Liu2026-03-10💻 cs

An Extended Topological Model For High-Contrast Optical Flow

This paper introduces an extended 3-manifold topological model for high-contrast optical flow that resolves the limitations of previous torus-based approaches by identifying that the most significant motion patches are concentrated near binary step-edge circles rather than the torus, thereby offering new insights into the topological and geometric structures underlying visual data inference.

Brad Turow, Jose A. Perea2026-03-10🔢 math

ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting

This paper introduces ColonSplat, a dynamic Gaussian Splatting framework that achieves superior 3D reconstruction of peristaltic colon motion by preserving global geometric consistency, supported by a new synthetic benchmark dataset called DynamicColon and a critical analysis of existing methods' limitations.

Weronika Smolak-Dy\.zewska, Joanna Kaleta, Diego Dall'Alba, Przemysław Spurek2026-03-10💻 cs

← Previous Next →

cs.CV