Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

This paper proposes IB-IUMAD, a novel incremental unified multimodal anomaly detection framework that mitigates catastrophic forgetting by leveraging a Mamba decoder to disentangle inter-object feature coupling and an information bottleneck module to filter redundant features, thereby preserving discriminative information across evolving categories.

Kaifang Long, Lianbo Ma, Jiaqi Liu + 2 more2026-03-04💻 cs

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

This paper introduces M3IRT, a multimodal item response theory framework that decomposes model ability and item difficulty into image-only, text-only, and cross-modal components to filter out shortcut questions, thereby enabling more reliable and cost-effective evaluation of genuine cross-modal reasoning in Multimodal Large Language Models.

Shunki Uebayashi, Kento Masui, Kyohei Atarashi + 5 more2026-03-04💬 cs.CL

Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement

This paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), a lightweight three-branch architecture that leverages complementary spatial and frequency domain representations to effectively address geometric asymmetry and texture inconsistencies in cross-view geo-localization, achieving state-of-the-art performance through multiscale structural modeling and frequency invariance.

Hongying Zhang, ShuaiShuai Ma2026-03-04💻 cs

Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

The paper introduces UNICORN, a unified public benchmark featuring a standardized two-step evaluation framework and a novel aggregate metric to systematically assess the cross-modality and cross-task generalization of medical foundation models across diverse imaging and natural language data from multiple institutions.

Michelle Stegeman, Lena Philipp, Fennie van der Graaf + 19 more2026-03-04💻 cs

Structure-Aware Text Recognition for Ancient Greek Critical Editions

This paper addresses the limitations of visual language models in recognizing the complex layouts of Ancient Greek critical editions by introducing a large-scale synthetic corpus and a real-world benchmark, demonstrating that while zero-shot performance lags behind traditional tools, fine-tuned models like Qwen3VL-8B can achieve state-of-the-art accuracy.

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot + 1 more2026-03-04💻 cs