cs.CV papers | Gist.Science

SIQA: Toward Reliable Scientific Image Quality Assessment

This paper introduces the SIQA framework, which redefines scientific image quality assessment by distinguishing between perceptual alignment and scientific correctness, and demonstrates through a new benchmark that current multimodal models often achieve high scoring consistency with experts while lacking genuine scientific understanding.

Wenzhe Li, Liang Chen, Junying Wang, Yijing Guo, Ye Shen, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai2026-03-10💻 cs

On the Generalization Capacities of MLLMs for Spatial Intelligence

This paper argues that RGB-only Multimodal Large Language Models fail to generalize across different cameras due to entangled perspective and object properties, and proposes a Camera-Aware MLLM framework that integrates camera intrinsics, augmented data, and 3D geometric priors to achieve robust, generalizable spatial intelligence.

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, Ran Xu2026-03-10🤖 cs.LG

Uncertainty-Aware Solar Flare Regression

This paper enhances the reliability of solar flare regression by applying conformal prediction to deep learning models, demonstrating that conformalized quantile regression outperforms alternative methods in achieving valid coverage rates and favorable interval lengths for space weather forecasting.

Jinsu Hong, Chetraj Pandey, Berkay Aydin2026-03-10🔭 astro-ph

UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms

This paper introduces Universal Watermark Presence Detection (UWPD), a novel task for identifying invisible watermarks without prior algorithm knowledge, supported by the UniFreq-100K dataset and the Frequency Shield Network (FSNet) model that achieves superior zero-shot detection by dynamically amplifying high-frequency watermark signals while suppressing semantic content.

Xiang Ao, Yiling Du, Zidan Wang, Mengru Chen2026-03-10💻 cs

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

This paper introduces the Open-Vocabulary Temporal Sentence Grounding (OV-TSGV) task with new benchmarks (Charades-OV and ActivityNet-OV) and proposes HERO, a hierarchical embedding-refinement framework that achieves state-of-the-art performance by effectively generalizing to novel linguistic expressions through multi-level semantic modeling and cross-modal refinement.

Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu2026-03-10💻 cs

Vessel-Aware Deep Learning for OCTA-Based Detection of AMD

This paper proposes a vessel-aware deep learning framework for detecting age-related macular degeneration (AMD) in OCTA images by integrating external multiplicative attention with clinically meaningful vascular biomarkers, specifically tortuosity and dropout maps, to guide the model toward physiologically relevant regions and improve interpretability.

Margalit G. Mitzner, Moinak Bhattacharya, Zhilin Zou, Chao Chen, Prateek Prasanna2026-03-10💻 cs

Heterogeneous Decentralized Diffusion Models

This paper introduces an efficient framework for heterogeneous decentralized diffusion models that enables experts to train with mixed objectives (DDPM and Flow Matching) and reduced resource requirements, achieving a 16x decrease in compute and 14x reduction in data compared to prior approaches while improving image quality and diversity.

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy2026-03-10🤖 cs.LG

ButterflyViT: 354 $\times$ Expert Compression for Edge Vision Transformers

ButterflyViT introduces a geometric parameterization method that treats Mixture of Experts as rotations of a shared quantized substrate, achieving a 354 $\times$ memory reduction for Vision Transformers on edge devices while maintaining accuracy through spatial smoothness regularization.

Aryan Karmore2026-03-10💻 cs

XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification

This paper introduces XMACNet, an explainable, lightweight CNN that combines self-attention mechanisms with multi-modal fusion of RGB images and vegetation indices to achieve high-accuracy chili disease classification suitable for edge deployment.

Tapon Kumer Ray, Rajkumar Y, Shalini R, Srigayathri K, Jayashree S, Lokeswari P2026-03-10💻 cs

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

This paper introduces EarthBridge, a high-fidelity cross-modal translation framework combining Diffusion Bridge Implicit Models and Contrastive Unpaired Translation to achieve second place in the 4th Multi-modal Aerial View Image Challenge by effectively translating between SAR, EO, and IR aerial imagery.

Zhenyuan Chen, Guanyuan Shen, Feng Zhang2026-03-10💻 cs

HiDE: Hierarchical Dictionary-Based Entropy Modeling for Learned Image Compression

The paper proposes HiDE, a hierarchical dictionary-based entropy modeling framework for learned image compression that enhances coding efficiency by decomposing external priors into global and local dictionaries with cascaded retrieval and employing a context-aware parameter estimator to achieve significant BD-rate savings over state-of-the-art methods.

Haoxuan Xiong, Yuanyuan Xu, Kun Zhu, Yiming Wang, Baoliu Ye2026-03-10💻 cs

A Hybrid Machine Learning Model for Cerebral Palsy Detection

This paper presents a hybrid machine learning model that combines VGG19, EfficientNet, and ResNet50 for feature extraction with a Bi-LSTM classifier to achieve a 98.83% accuracy in the early detection of Cerebral Palsy from MRI images, outperforming several individual pre-trained models.

Karan Kumar Singh, Nikita Gajbhiye, Gouri Sankar Mishra2026-03-10💻 cs

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

This paper establishes that the quality of a model's step-level visual grounding, quantified by the Step Grounding Rate (SGR), serves as a robust and independent predictor of out-of-distribution generalization in long-horizon vision-language models, outperforming traditional final-answer accuracy metrics.

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin2026-03-10💻 cs

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

This paper introduces MotionBits, a novel concept and learning-free segmentation method that identifies the smallest manipulable rigid bodies through kinematic spatial twist equivalence, outperforming state-of-the-art embodied perception models on the new MoRiBo benchmark and enabling more effective downstream robotic manipulation and reasoning tasks.

Howard H. Qian, Kejia Ren, Yu Xiang, Vicente Ordonez, Kaiyu Hang2026-03-10💻 cs

Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction

This paper introduces Perturbed Gaussian Ensemble, an active view selection framework for sparse-view CT that leverages stochastic density scaling of uncertain Gaussian primitives to identify high-variance projections, thereby significantly improving reconstruction fidelity and reducing geometric artifacts compared to existing methods.

Yulun Wu, Ruyi Zha, Wei Cao, Yingying Li, Yuanhao Cai, Yaoyao Liu2026-03-10💻 cs

An Extended Topological Model For High-Contrast Optical Flow

This paper introduces an extended 3-manifold topological model for high-contrast optical flow that resolves the limitations of previous torus-based approaches by identifying that the most significant motion patches are concentrated near binary step-edge circles rather than the torus, thereby offering new insights into the topological and geometric structures underlying visual data inference.

Brad Turow, Jose A. Perea2026-03-10🔢 math

ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting

This paper introduces ColonSplat, a dynamic Gaussian Splatting framework that achieves superior 3D reconstruction of peristaltic colon motion by preserving global geometric consistency, supported by a new synthetic benchmark dataset called DynamicColon and a critical analysis of existing methods' limitations.

Weronika Smolak-Dy\.zewska, Joanna Kaleta, Diego Dall'Alba, Przemysław Spurek2026-03-10💻 cs

IGLU: The Integrated Gaussian Linear Unit Activation Function

This paper introduces IGLU, a novel parametric activation function derived from a scale mixture of GELU gates that utilizes a Cauchy CDF to provide heavy-tailed gradient properties and robustness against vanishing gradients, alongside a computationally efficient rational approximation (IGLU-Approx) that achieves competitive or superior performance across vision and language tasks compared to standard baselines like ReLU and GELU.

Mingi Kang, Zai Yang, Jeova Farias Sales Rocha Neto2026-03-10🤖 cs.LG

A prior information informed learning architecture for flying trajectory prediction

This paper proposes a hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture to accurately predict the landing points of flying objects, such as tennis balls, by outperforming existing methods in complex real-world scenarios.

Xianda Huang, Zidong Han, Ruibo Jin, Zhenyu Wang, Wenyu Li, Xiaoyang Li, Yi Gong2026-03-10💻 cs

PICS: Pairwise Image Compositing with Spatial Interactions

The paper introduces PICS, a self-supervised framework that improves pairwise image compositing by employing an Interaction Transformer with mask-guided Mixture-of-Experts and adaptive blending to explicitly model spatial interactions and preserve physical consistency between objects and backgrounds.

Hang Zhou, Xinxin Zuo, Sen Wang, Li Cheng2026-03-10💻 cs

← Previous Next →

cs.CV