cs.CV papers | Gist.Science

SIQA: Toward Reliable Scientific Image Quality Assessment

This paper introduces the SIQA framework, which redefines scientific image quality assessment by distinguishing between perceptual alignment and scientific correctness, and demonstrates through a new benchmark that current multimodal models often achieve high scoring consistency with experts while lacking genuine scientific understanding.

Wenzhe Li, Liang Chen, Junying Wang, Yijing Guo, Ye Shen, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao ZhaiTue, 10 Ma💻 cs

On the Generalization Capacities of MLLMs for Spatial Intelligence

This paper argues that RGB-only Multimodal Large Language Models fail to generalize across different cameras due to entangled perspective and object properties, and proposes a Camera-Aware MLLM framework that integrates camera intrinsics, augmented data, and 3D geometric priors to achieve robust, generalizable spatial intelligence.

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, Ran XuTue, 10 Ma🤖 cs.LG

Uncertainty-Aware Solar Flare Regression

This paper enhances the reliability of solar flare regression by applying conformal prediction to deep learning models, demonstrating that conformalized quantile regression outperforms alternative methods in achieving valid coverage rates and favorable interval lengths for space weather forecasting.

Jinsu Hong, Chetraj Pandey, Berkay AydinTue, 10 Ma🔭 astro-ph

UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms

This paper introduces Universal Watermark Presence Detection (UWPD), a novel task for identifying invisible watermarks without prior algorithm knowledge, supported by the UniFreq-100K dataset and the Frequency Shield Network (FSNet) model that achieves superior zero-shot detection by dynamically amplifying high-frequency watermark signals while suppressing semantic content.

Xiang Ao, Yiling Du, Zidan Wang, Mengru ChenTue, 10 Ma💻 cs

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

This paper introduces the Open-Vocabulary Temporal Sentence Grounding (OV-TSGV) task with new benchmarks (Charades-OV and ActivityNet-OV) and proposes HERO, a hierarchical embedding-refinement framework that achieves state-of-the-art performance by effectively generalizing to novel linguistic expressions through multi-level semantic modeling and cross-modal refinement.

Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou YuTue, 10 Ma💻 cs

Vessel-Aware Deep Learning for OCTA-Based Detection of AMD

This paper proposes a vessel-aware deep learning framework for detecting age-related macular degeneration (AMD) in OCTA images by integrating external multiplicative attention with clinically meaningful vascular biomarkers, specifically tortuosity and dropout maps, to guide the model toward physiologically relevant regions and improve interpretability.

Margalit G. Mitzner, Moinak Bhattacharya, Zhilin Zou, Chao Chen, Prateek PrasannaTue, 10 Ma💻 cs

Heterogeneous Decentralized Diffusion Models

This paper introduces an efficient framework for heterogeneous decentralized diffusion models that enables experts to train with mixed objectives (DDPM and Flow Matching) and reduced resource requirements, achieving a 16x decrease in compute and 14x reduction in data compared to prior approaches while improving image quality and diversity.

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan RoyTue, 10 Ma🤖 cs.LG

ButterflyViT: 354 $\times$ Expert Compression for Edge Vision Transformers

ButterflyViT introduces a geometric parameterization method that treats Mixture of Experts as rotations of a shared quantized substrate, achieving a 354 $\times$ memory reduction for Vision Transformers on edge devices while maintaining accuracy through spatial smoothness regularization.

Aryan KarmoreTue, 10 Ma💻 cs

XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification

This paper introduces XMACNet, an explainable, lightweight CNN that combines self-attention mechanisms with multi-modal fusion of RGB images and vegetation indices to achieve high-accuracy chili disease classification suitable for edge deployment.

Tapon Kumer Ray, Rajkumar Y, Shalini R, Srigayathri K, Jayashree S, Lokeswari PTue, 10 Ma💻 cs

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

This paper introduces EarthBridge, a high-fidelity cross-modal translation framework combining Diffusion Bridge Implicit Models and Contrastive Unpaired Translation to achieve second place in the 4th Multi-modal Aerial View Image Challenge by effectively translating between SAR, EO, and IR aerial imagery.

Zhenyuan Chen, Guanyuan Shen, Feng ZhangTue, 10 Ma💻 cs

A Hybrid Machine Learning Model for Cerebral Palsy Detection

This paper presents a hybrid machine learning model that combines VGG19, EfficientNet, and ResNet50 for feature extraction with a Bi-LSTM classifier to achieve a 98.83% accuracy in the early detection of Cerebral Palsy from MRI images, outperforming several individual pre-trained models.

Karan Kumar Singh, Nikita Gajbhiye, Gouri Sankar MishraTue, 10 Ma💻 cs

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

This paper establishes that the quality of a model's step-level visual grounding, quantified by the Step Grounding Rate (SGR), serves as a robust and independent predictor of out-of-distribution generalization in long-horizon vision-language models, outperforming traditional final-answer accuracy metrics.

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed NoshinTue, 10 Ma💻 cs

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

This paper introduces MotionBits, a novel concept and learning-free segmentation method that identifies the smallest manipulable rigid bodies through kinematic spatial twist equivalence, outperforming state-of-the-art embodied perception models on the new MoRiBo benchmark and enabling more effective downstream robotic manipulation and reasoning tasks.

Howard H. Qian, Kejia Ren, Yu Xiang, Vicente Ordonez, Kaiyu HangTue, 10 Ma💻 cs

Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction

This paper introduces Perturbed Gaussian Ensemble, an active view selection framework for sparse-view CT that leverages stochastic density scaling of uncertain Gaussian primitives to identify high-variance projections, thereby significantly improving reconstruction fidelity and reducing geometric artifacts compared to existing methods.

Yulun Wu, Ruyi Zha, Wei Cao, Yingying Li, Yuanhao Cai, Yaoyao LiuTue, 10 Ma💻 cs

An Extended Topological Model For High-Contrast Optical Flow

This paper introduces an extended 3-manifold topological model for high-contrast optical flow that resolves the limitations of previous torus-based approaches by identifying that the most significant motion patches are concentrated near binary step-edge circles rather than the torus, thereby offering new insights into the topological and geometric structures underlying visual data inference.

Brad Turow, Jose A. PereaTue, 10 Ma🔢 math

ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting

This paper introduces ColonSplat, a dynamic Gaussian Splatting framework that achieves superior 3D reconstruction of peristaltic colon motion by preserving global geometric consistency, supported by a new synthetic benchmark dataset called DynamicColon and a critical analysis of existing methods' limitations.

Weronika Smolak-Dy\.zewska, Joanna Kaleta, Diego Dall'Alba, Przemysław SpurekTue, 10 Ma💻 cs

IGLU: The Integrated Gaussian Linear Unit Activation Function

This paper introduces IGLU, a novel parametric activation function derived from a scale mixture of GELU gates that utilizes a Cauchy CDF to provide heavy-tailed gradient properties and robustness against vanishing gradients, alongside a computationally efficient rational approximation (IGLU-Approx) that achieves competitive or superior performance across vision and language tasks compared to standard baselines like ReLU and GELU.

Mingi Kang, Zai Yang, Jeova Farias Sales Rocha NetoTue, 10 Ma🤖 cs.LG

A prior information informed learning architecture for flying trajectory prediction

This paper proposes a hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture to accurately predict the landing points of flying objects, such as tennis balls, by outperforming existing methods in complex real-world scenarios.

Xianda Huang, Zidong Han, Ruibo Jin, Zhenyu Wang, Wenyu Li, Xiaoyang Li, Yi GongTue, 10 Ma💻 cs

PICS: Pairwise Image Compositing with Spatial Interactions

The paper introduces PICS, a self-supervised framework that improves pairwise image compositing by employing an Interaction Transformer with mask-guided Mixture-of-Experts and adaptive blending to explicitly model spatial interactions and preserve physical consistency between objects and backgrounds.

Hang Zhou, Xinxin Zuo, Sen Wang, Li ChengTue, 10 Ma💻 cs

OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation

This paper introduces OPTED, an open-source preprocessed trachoma eye dataset derived from 2,832 images using a zero-shot SAM 3 pipeline to automatically extract and standardize regions of interest, thereby addressing the scarcity of high-quality data for automated trachoma classification in Sub-Saharan Africa.

Kibrom Gebremedhin, Hadush Hailu, Bruk GebregziabherTue, 10 Ma💻 cs

← Previous Next →

cs.CV