NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

NeighborMAE is a self-supervised learning framework that enhances Earth Observation image representation by leveraging the spatial dependencies between neighboring images through joint reconstruction and a dynamic heuristic strategy for mask ratios and loss weighting, resulting in superior performance across various downstream tasks compared to existing baselines.

Liang Zeng, Valerio Marsocci, Wufan Zhao + 2 more2026-03-04💻 cs

On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

This paper proposes the Generation-Assisted Discriminative (GAD) classifier, a fine-tuning strategy that leverages the efficiency of discriminative classification while utilizing generative modeling to enhance performance, achieving state-of-the-art accuracy and significantly faster inference for closed-set action understanding in Multimodal Large Language Models.

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener + 1 more2026-03-04💻 cs

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

This paper proposes Generalizable Knowledge Distillation (GKD), a multi-stage framework that decouples representation learning from task adaptation and employs a query-based soft distillation mechanism to effectively transfer robust, domain-agnostic knowledge from vision foundation models to semantic segmentation tasks, significantly improving out-of-domain generalization compared to conventional methods.

Chonghua Lv, Dong Zhao, Shuang Wang + 4 more2026-03-04💻 cs

CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

The paper proposes CAWM-Mamba, a unified end-to-end framework that jointly performs infrared-visible image fusion and compound adverse weather restoration using a Weather-Aware Preprocess Module, Cross-modal Feature Interaction Module, and Wavelet Space State Block to outperform existing methods in handling multiple simultaneous degradations while enhancing downstream perception tasks.

Huichun Liu, Xiaosong Li, Zhuangfan Huang + 3 more2026-03-04💻 cs

Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification

This study demonstrates that combining Random Affine and Color Jitter augmentation techniques significantly enhances the generalization and accuracy of the lightweight EfficientViT model for Bengali handwritten character recognition on the Ekush and AIBangla datasets, achieving peak accuracies of 97.48% and 97.57% respectively.

Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha2026-03-04💻 cs

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

This paper proposes IB-IUMAD, a novel incremental unified multimodal anomaly detection framework that mitigates catastrophic forgetting by leveraging a Mamba decoder to disentangle inter-object feature coupling and an information bottleneck module to filter redundant features, thereby preserving discriminative information across evolving categories.

Kaifang Long, Lianbo Ma, Jiaqi Liu + 2 more2026-03-04💻 cs