cs.CV papers | Gist.Science

NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

NeighborMAE is a self-supervised learning framework that enhances Earth Observation image representation by leveraging the spatial dependencies between neighboring images through joint reconstruction and a dynamic heuristic strategy for mask ratios and loss weighting, resulting in superior performance across various downstream tasks compared to existing baselines.

Liang Zeng, Valerio Marsocci, Wufan Zhao + 2 more2026-03-04💻 cs

EIMC: Efficient Instance-aware Multi-modal Collaborative Perception

EIMC is an efficient instance-aware multi-modal collaborative perception framework that adopts an early collaborative paradigm and a heatmap-driven consensus protocol to selectively transmit only critical instance vectors, thereby significantly reducing bandwidth usage while enhancing detection accuracy for occluded objects in autonomous driving.

Kang Yang, Peng Wang, Lantao Li + 4 more2026-03-04💻 cs

Functional Properties of the Focal-Entropy

This paper provides a systematic information-theoretic analysis of the focal-entropy, establishing its mathematical properties and demonstrating how the focal-loss fundamentally alters probability distributions by amplifying mid-range probabilities while suppressing both high-probability and extremely low-probability outcomes in class-imbalanced learning.

Jaimin Shah, Martina Cardone, Alex Dytso2026-03-04📊 stat

ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection

This paper introduces ForestPersons, a large-scale dataset of 96,482 under-canopy images with over 200,000 annotations designed to address the limitations of aerial UAV imagery in detecting missing persons during forest Search and Rescue missions.

Deokyun Kim, Jeongjun Lee, Jungwon Choi + 6 more2026-03-04💻 cs

On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

This paper proposes the Generation-Assisted Discriminative (GAD) classifier, a fine-tuning strategy that leverages the efficiency of discriminative classification while utilizing generative modeling to enhance performance, achieving state-of-the-art accuracy and significantly faster inference for closed-set action understanding in Multimodal Large Language Models.

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener + 1 more2026-03-04💻 cs

SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding

SemGS is a feed-forward framework that reconstructs generalizable semantic 3D fields from sparse views using a dual-branch architecture with shared CNN layers and camera-aware attention, enabling rapid, state-of-the-art semantic scene understanding and novel view synthesis without scene-specific optimization.

Sheng Ye, Zhen-Hui Dong, Ruoyu Fan + 2 more2026-03-04💻 cs

Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery

This paper presents a collision-free dual-arm surgical assistive robot that leverages a vision-language model for zero-shot instruction interpretation and a real-time quadratic programming framework to ensure safe, reactive obstacle avoidance while autonomously delivering instruments with an 83.33% success rate.

Xuejin Luo, Shiquan Sun, Runshi Zhang + 2 more2026-03-04🤖 cs.LG

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

This paper proposes Generalizable Knowledge Distillation (GKD), a multi-stage framework that decouples representation learning from task adaptation and employs a query-based soft distillation mechanism to effectively transfer robust, domain-agnostic knowledge from vision foundation models to semantic segmentation tasks, significantly improving out-of-domain generalization compared to conventional methods.

Chonghua Lv, Dong Zhao, Shuang Wang + 4 more2026-03-04💻 cs

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

This paper proposes VC-STaR, a novel self-improving framework that leverages visual contrastive pairs to mitigate hallucinations in model-generated rationales, resulting in the VisCoR-55K dataset that significantly enhances the visual reasoning capabilities of Vision Language Models.

Zhiyu Pan, Yizheng Wu, Jiashen Hua + 5 more2026-03-04💬 cs.CL

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

This paper proposes CAPT, a Confusion-Aware Prompt Tuning framework that mitigates vision-language misalignment by explicitly modeling persistent category confusion through a Confusion Bank and integrating semantic and sample-level cues via specialized miners and a multi-granularity expert to significantly reduce classification errors.

Maoyuan Shao, Yutong Gao, Xinyang Huang + 3 more2026-03-04🤖 cs.AI

CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

The paper proposes CAWM-Mamba, a unified end-to-end framework that jointly performs infrared-visible image fusion and compound adverse weather restoration using a Weather-Aware Preprocess Module, Cross-modal Feature Interaction Module, and Wavelet Space State Block to outperform existing methods in handling multiple simultaneous degradations while enhancing downstream perception tasks.

Huichun Liu, Xiaosong Li, Zhuangfan Huang + 3 more2026-03-04💻 cs

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

The paper introduces SOLAR, a recommendation framework that employs SVD-Optimized Attention to achieve theoretically lossless, low-rank sequence modeling with reduced computational complexity, enabling efficient processing of ultra-long user behavior sequences and delivering significant performance gains in Kuaishou's online recommendation system.

Chenghao Zhang, Chao Feng, Yuanhao Pu + 8 more2026-03-04🤖 cs.LG

ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

This paper proposes ATD, a novel transformer-based architecture for image restoration that utilizes a learnable token dictionary and a token dictionary cross-attention mechanism to achieve global dependency modeling with linear complexity, thereby overcoming the performance and efficiency limitations of existing window-based methods.

Leheng Zhang, Wei Long, Yawei Li + 3 more2026-03-04💻 cs

Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction

This paper introduces NEMF, a novel framework that leverages high-fidelity geometry and ambient RF signals to solve the ill-posed physical inversion problem, enabling the non-invasive reconstruction of dense material parameters for creating functional, simulatable Digital Twins.

Zhe Chen, Peilin Zheng, Wenshuo Chen + 3 more2026-03-04⚡ eess

Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification

This study demonstrates that combining Random Affine and Color Jitter augmentation techniques significantly enhances the generalization and accuracy of the lightweight EfficientViT model for Bengali handwritten character recognition on the Ekush and AIBangla datasets, achieving peak accuracies of 97.48% and 97.57% respectively.

Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha2026-03-04💻 cs

Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation

This paper introduces Synthetic-Child, an AIGC-based pipeline that generates 12,000 privacy-preserving synthetic images of children using 3D modeling and FLUX-1 diffusion to train a quantized RTMPose-M model, achieving 71.2 AP on real-world data and outperforming both adult-data baselines and commercial posture correctors in accuracy and speed for edge deployment.

Taowen Zeng2026-03-04💻 cs

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

VLMFusionOcc3D is a robust multimodal framework for autonomous driving that leverages Vision-Language Models to resolve semantic ambiguities and employs a weather-aware adaptive fusion mechanism to significantly improve 3D semantic occupancy prediction accuracy, particularly under adverse weather conditions.

A. Enes Doruk, Hasan F. Ates2026-03-04💻 cs

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

This paper introduces DrPose, a direct reward fine-tuning algorithm that leverages a novel dataset of pose-image pairs to optimize multi-view diffusion models for generating 3D humans with more natural and diverse poses from single images, eliminating the need for expensive 3D assets.

Seunguk Do, Minwoo Huh, Joonghyuk Shin + 1 more2026-03-04💻 cs

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

This paper proposes IB-IUMAD, a novel incremental unified multimodal anomaly detection framework that mitigates catastrophic forgetting by leveraging a Mamba decoder to disentangle inter-object feature coupling and an information bottleneck module to filter redundant features, thereby preserving discriminative information across evolving categories.

Kaifang Long, Lianbo Ma, Jiaqi Liu + 2 more2026-03-04💻 cs

SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

This paper introduces SEP-YOLO, a novel framework that combines frequency-domain detail enhancement and multi-scale spatial refinement to achieve state-of-the-art transparent object instance segmentation, while also providing high-quality annotations for the Trans10K dataset.

Fengming Zhang, Tao Yan, Jianchao Huang2026-03-04💻 cs

← Previous Next →