AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

This paper presents AgilePruner, an adaptive visual token pruning framework for Large Vision-Language Models that leverages empirical insights into the complementary strengths of attention-based and diversity-based methods to reduce computational overhead while mitigating hallucinations across varying image complexities.

Changwoo Baek, Jouwon Song, Sohyeon Kim + 1 more2026-03-03🤖 cs.LG

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

The MAMA-MIA Challenge establishes a large-scale, multi-institutional benchmark using US training and European test data to evaluate and improve the generalizability and fairness of AI models for breast MRI tumor segmentation and treatment response prediction across diverse demographic subgroups.

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar + 43 more2026-03-03🤖 cs.AI

Certifiable Estimation with Factor Graphs

This paper presents a unified framework that synthesizes modular factor graph modeling with certifiable convex relaxation techniques by demonstrating that key mathematical transformations preserve factor graph structure, thereby enabling the use of existing, high-performance robotics libraries to implement globally optimal estimation without requiring specialized solver expertise.

Zhexin Xu, Nikolas R. Sanderson, Hanna Jiamei Zhang + 1 more2026-03-03💻 cs

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

This paper presents a controlled study demonstrating that reinforcement learning primarily sharpens output distributions and improves sampling efficiency in medical Vision-Language Models only after supervised fine-tuning has established non-trivial reasoning support, leading to a boundary-aware training recipe that achieves strong performance across medical benchmarks.

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh + 4 more2026-03-03💻 cs

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

This paper presents AG-VAS, a novel zero-shot visual anomaly segmentation framework that leverages Large Multimodal Models enhanced with learnable semantic anchor tokens, a Semantic-Pixel Alignment Module, and a specialized instruction dataset to overcome limitations in abstract concept representation and achieve state-of-the-art localization performance across industrial and medical benchmarks.

Zhen Qu, Xian Tao, Xiaoyi Bao + 4 more2026-03-03🤖 cs.AI

Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding

This paper presents a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding across multiple datasets, concluding that while foundation models offer flexibility, supervised training remains the most reliable approach for accurately detecting small objects and delineating boundaries in cluttered disaster scenes when annotations are available.

Anna Michailidou, Georgios Angelidis, Vasileios Argyriou + 2 more2026-03-03💻 cs

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

MixerCSeg is an efficient, state-of-the-art architecture for crack segmentation that integrates CNN-like local texture analysis, Transformer-style global dependency modeling, and Mamba-inspired sequential context processing within a unified encoder, enhanced by specialized edge-aware and multi-scale fusion modules to achieve high performance with minimal computational cost.

Zilong Zhao, Zhengming Ding, Pei Niu + 2 more2026-03-03🤖 cs.AI

TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity

The paper proposes TIMI, a training-free framework that achieves high spatial fidelity in image-to-3D multi-instance generation by leveraging pre-trained model priors through an Instance-aware Separation Guidance module for disentanglement and a Spatial-stabilized Geometry-adaptive Update module for geometric preservation, outperforming existing methods without additional training overhead.

Xiao Cai, Lianli Gao, Pengpeng Zeng + 3 more2026-03-03💻 cs

Unifying Language-Action Understanding and Generation for Autonomous Driving

This paper introduces LinkVLA, a novel Vision-Language-Action model for autonomous driving that unifies language and action tokens in a shared codebook, incorporates an auxiliary action understanding objective for bidirectional semantic alignment, and employs a coarse-to-fine generation strategy to significantly improve instruction following, driving performance, and inference efficiency.

Xinyang Wang, Qian Liu, Wenjie Ding + 7 more2026-03-03💻 cs

Revisiting Global Token Mixing in Task-Dependent MRI Restoration: Insights from Minimal Gated CNN Baselines

This paper demonstrates that the effectiveness of global token mixing in MRI restoration is highly task-dependent, showing that while local gated CNNs suffice for reconstruction and super-resolution tasks constrained by physics or preserved low-frequency data, global models are superior for denoising tasks involving spatially heteroscedastic noise.

Xiangjian Hou, Chao Qin, Chang Ni + 3 more2026-03-03⚡ eess