cs.CV papers | Gist.Science

Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

This paper addresses the lack of granular data for skin tone fairness by introducing the large-scale, open-access STW dataset labeled with the 10-tone MST scale, benchmarking deep learning against classic methods, and proposing the SkinToneNet model to achieve state-of-the-art generalization for reliable fairness auditing.

Vitor Pereira Matias, Márcus Vinícius Lobo Costa, João Batista Neto + 1 more2026-03-04🤖 cs.LG

E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition

The paper proposes E2E-GNet, an end-to-end geometric deep neural network that utilizes a geometric transformation layer and a distortion-aware optimization layer to effectively project skeleton motion sequences from non-Euclidean to linear space, thereby achieving superior human motion recognition performance with lower computational cost across multiple datasets.

Mubarak Olaoluwa, Hassen Drira2026-03-04💻 cs

ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop

ModalPatch is a plug-and-play module that enhances the robustness of multi-modal 3D object detection under arbitrary modality-drop scenarios by leveraging temporal history to predict missing features and employing an uncertainty-guided fusion strategy to ensure reliable compensation without requiring architectural changes or retraining.

Shuangzhi Li, Lei Ma, Xingyu Li2026-03-04💻 cs

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

MUSE is an open-source, run-centric platform that addresses the gap in multimodal safety evaluation by integrating automatic cross-modal payload generation, multi-turn attack algorithms with inter-turn modality switching, and a dual-metric framework to demonstrate that alignment often fails to generalize across audio, image, and video inputs, revealing significantly higher attack success rates than single-turn text-based evaluations suggest.

Zhongxi Wang, Yueqian Lin, Jingyang Zhang + 2 more2026-03-04⚡ eess

Geometric structures and deviations on James' symmetric positive-definite matrix bicone domain

This paper introduces two new geometric structures on the symmetric positive-definite matrix cone derived from James' bicone reparameterization, which ensure geodesics correspond to straight lines, generalize the Hilbert simplex distance, and offer new tools for analyzing dissimilarities across various scientific disciplines.

Jacek Karwowski, Frank Nielsen2026-03-04📊 stat

WTHaar-Net: a Hybrid Quantum-Classical Approach

This paper introduces WTHaar-Net, a hybrid quantum-classical convolutional neural network that replaces the Hadamard Transform with the spatially localized Haar Wavelet Transform to achieve significant parameter reduction and competitive accuracy on image classification tasks while demonstrating compatibility with near-term quantum hardware.

Vittorio Palladino, Tsai Idden, Ahmet Enis Cetin2026-03-04💻 cs

Biomechanically Accurate Gait Analysis: A 3d Human Reconstruction Framework for Markerless Estimation of Gait Parameters

This paper introduces a scalable, markerless 3D human reconstruction framework that extracts biomechanically meaningful markers from video to accurately estimate gait parameters, demonstrating strong agreement with reference marker-based data and outperforming conventional pose-estimation methods for clinical and real-world applications.

Akila Pemasiri, Ethan Goan, Glen Lichtwark + 3 more2026-03-04⚡ eess

SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data

This paper proposes the Semantic-Guided Modality-Aware (SGMA) framework, a novel approach for incomplete multimodal semantic segmentation in remote sensing that utilizes Semantic-Guided Fusion and Modality-Aware Sampling modules to effectively address multimodal imbalance, intra-class variation, and cross-modal heterogeneity, thereby outperforming state-of-the-art methods.

Lekang Wen, Liang Liao, Jing Xiao + 1 more2026-03-04💻 cs

Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks

This paper demonstrates that replacing rigid anatomical parcellations with functionally-derived regions of interest within a Graph Attention Network ensemble significantly enhances explainable Autism Spectrum Disorder classification accuracy on rs-fMRI data, achieving state-of-the-art performance while identifying biologically relevant Default Mode Network hubs.

Syeda Hareem Madani, Noureen Bibi, Adam Rafiq Jeraj + 3 more2026-03-04💻 cs

NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

NeighborMAE is a self-supervised learning framework that enhances Earth Observation image representation by leveraging the spatial dependencies between neighboring images through joint reconstruction and a dynamic heuristic strategy for mask ratios and loss weighting, resulting in superior performance across various downstream tasks compared to existing baselines.

Liang Zeng, Valerio Marsocci, Wufan Zhao + 2 more2026-03-04💻 cs

EIMC: Efficient Instance-aware Multi-modal Collaborative Perception

EIMC is an efficient instance-aware multi-modal collaborative perception framework that adopts an early collaborative paradigm and a heatmap-driven consensus protocol to selectively transmit only critical instance vectors, thereby significantly reducing bandwidth usage while enhancing detection accuracy for occluded objects in autonomous driving.

Kang Yang, Peng Wang, Lantao Li + 4 more2026-03-04💻 cs

Functional Properties of the Focal-Entropy

This paper provides a systematic information-theoretic analysis of the focal-entropy, establishing its mathematical properties and demonstrating how the focal-loss fundamentally alters probability distributions by amplifying mid-range probabilities while suppressing both high-probability and extremely low-probability outcomes in class-imbalanced learning.

Jaimin Shah, Martina Cardone, Alex Dytso2026-03-04📊 stat

ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection

This paper introduces ForestPersons, a large-scale dataset of 96,482 under-canopy images with over 200,000 annotations designed to address the limitations of aerial UAV imagery in detecting missing persons during forest Search and Rescue missions.

Deokyun Kim, Jeongjun Lee, Jungwon Choi + 6 more2026-03-04💻 cs

On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

This paper proposes the Generation-Assisted Discriminative (GAD) classifier, a fine-tuning strategy that leverages the efficiency of discriminative classification while utilizing generative modeling to enhance performance, achieving state-of-the-art accuracy and significantly faster inference for closed-set action understanding in Multimodal Large Language Models.

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener + 1 more2026-03-04💻 cs

SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding

SemGS is a feed-forward framework that reconstructs generalizable semantic 3D fields from sparse views using a dual-branch architecture with shared CNN layers and camera-aware attention, enabling rapid, state-of-the-art semantic scene understanding and novel view synthesis without scene-specific optimization.

Sheng Ye, Zhen-Hui Dong, Ruoyu Fan + 2 more2026-03-04💻 cs

Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery

This paper presents a collision-free dual-arm surgical assistive robot that leverages a vision-language model for zero-shot instruction interpretation and a real-time quadratic programming framework to ensure safe, reactive obstacle avoidance while autonomously delivering instruments with an 83.33% success rate.

Xuejin Luo, Shiquan Sun, Runshi Zhang + 2 more2026-03-04🤖 cs.LG

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

This paper proposes Generalizable Knowledge Distillation (GKD), a multi-stage framework that decouples representation learning from task adaptation and employs a query-based soft distillation mechanism to effectively transfer robust, domain-agnostic knowledge from vision foundation models to semantic segmentation tasks, significantly improving out-of-domain generalization compared to conventional methods.

Chonghua Lv, Dong Zhao, Shuang Wang + 4 more2026-03-04💻 cs

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

This paper proposes VC-STaR, a novel self-improving framework that leverages visual contrastive pairs to mitigate hallucinations in model-generated rationales, resulting in the VisCoR-55K dataset that significantly enhances the visual reasoning capabilities of Vision Language Models.

Zhiyu Pan, Yizheng Wu, Jiashen Hua + 5 more2026-03-04💬 cs.CL

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

This paper proposes CAPT, a Confusion-Aware Prompt Tuning framework that mitigates vision-language misalignment by explicitly modeling persistent category confusion through a Confusion Bank and integrating semantic and sample-level cues via specialized miners and a multi-granularity expert to significantly reduce classification errors.

Maoyuan Shao, Yutong Gao, Xinyang Huang + 3 more2026-03-04🤖 cs.AI

CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

The paper proposes CAWM-Mamba, a unified end-to-end framework that jointly performs infrared-visible image fusion and compound adverse weather restoration using a Weather-Aware Preprocess Module, Cross-modal Feature Interaction Module, and Wavelet Space State Block to outperform existing methods in handling multiple simultaneous degradations while enhancing downstream perception tasks.

Huichun Liu, Xiaosong Li, Zhuangfan Huang + 3 more2026-03-04💻 cs

← Previous Next →