Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding

This paper presents a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding across multiple datasets, concluding that while foundation models offer flexibility, supervised training remains the most reliable approach for accurately detecting small objects and delineating boundaries in cluttered disaster scenes when annotations are available.

Anna Michailidou, Georgios Angelidis, Vasileios Argyriou + 2 more2026-03-03💻 cs

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

MixerCSeg is an efficient, state-of-the-art architecture for crack segmentation that integrates CNN-like local texture analysis, Transformer-style global dependency modeling, and Mamba-inspired sequential context processing within a unified encoder, enhanced by specialized edge-aware and multi-scale fusion modules to achieve high performance with minimal computational cost.

Zilong Zhao, Zhengming Ding, Pei Niu + 2 more2026-03-03🤖 cs.AI

TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity

The paper proposes TIMI, a training-free framework that achieves high spatial fidelity in image-to-3D multi-instance generation by leveraging pre-trained model priors through an Instance-aware Separation Guidance module for disentanglement and a Spatial-stabilized Geometry-adaptive Update module for geometric preservation, outperforming existing methods without additional training overhead.

Xiao Cai, Lianli Gao, Pengpeng Zeng + 3 more2026-03-03💻 cs

Unifying Language-Action Understanding and Generation for Autonomous Driving

This paper introduces LinkVLA, a novel Vision-Language-Action model for autonomous driving that unifies language and action tokens in a shared codebook, incorporates an auxiliary action understanding objective for bidirectional semantic alignment, and employs a coarse-to-fine generation strategy to significantly improve instruction following, driving performance, and inference efficiency.

Xinyang Wang, Qian Liu, Wenjie Ding + 7 more2026-03-03💻 cs

Revisiting Global Token Mixing in Task-Dependent MRI Restoration: Insights from Minimal Gated CNN Baselines

This paper demonstrates that the effectiveness of global token mixing in MRI restoration is highly task-dependent, showing that while local gated CNNs suffice for reconstruction and super-resolution tasks constrained by physics or preserved low-frequency data, global models are superior for denoising tasks involving spatially heteroscedastic noise.

Xiangjian Hou, Chao Qin, Chang Ni + 3 more2026-03-03⚡ eess

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

This paper introduces MM-Mem, a cognition-inspired pyramidal multimodal memory architecture that leverages Fuzzy-Trace Theory and a Semantic Information Bottleneck to progressively distill verbatim visual details into abstract semantic schemas, thereby enabling efficient long-horizon video understanding through hierarchical storage and entropy-driven retrieval.

Niu Lian, Yuting Wang, Hanshu Yao + 5 more2026-03-03💬 cs.CL

UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation

To address the limitations of existing sequential models in handling noisy echocardiography probe trajectories, this paper proposes UltraStar, a semantic-aware star graph framework that reformulates navigation as anchor-based global localization by connecting the current view directly to representative historical keyframes, thereby achieving robust performance and better scalability on large-scale datasets.

Teng Wang, Haojun Jiang, Chenxi Li + 6 more2026-03-03💻 cs

SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout

This paper presents SCATR, a novel LiDAR-based tracking-by-attention framework that mitigates new instance suppression and bridges the performance gap with detection-based methods through two architecture-agnostic training strategies: Second Chance Assignment and Track Query Dropout, achieving state-of-the-art results on the nuScenes benchmark.

Brian Cheong, Letian Wang, Sandro Papais + 1 more2026-03-03💻 cs

ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models

The paper proposes ATA, a novel training-free, plug-and-play framework that enhances Vision-Language-Action models by introducing implicit reasoning through complementary attention-guided and action-guided strategies, thereby improving task success and robustness without the need for additional annotations or retraining.

Cheng Yang, Jianhao Jiao, Lingyi Huang + 8 more2026-03-03🤖 cs.AI

Rate-Distortion Signatures of Generalization and Information Trade-offs

This paper introduces a rate-distortion-theoretic framework that characterizes the generalization trade-offs of human and machine vision systems using geometric signatures of slope and curvature, revealing that while both follow a common lossy-compression principle, humans exhibit smoother and more flexible trade-offs compared to the steeper, more brittle regimes of modern deep networks.

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin2026-03-03🧬 q-bio