Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

This paper introduces VGUBench to demonstrate that while Unified Multimodal Large Language Models exhibit strong textual reasoning and visual rendering capabilities individually, they fail to maintain semantic equivalence when required to generate visual answers, revealing a critical breakdown in cross-modal semantic alignment rather than a lack of generation fidelity.

Hongbo Jiang, Jie Li, Yunhang Shen + 4 more2026-03-02💻 cs

StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation

StemVLA is an open-source Vision-Language-Action model that enhances robot manipulation performance on long-horizon tasks by explicitly integrating predicted future 3D spatial geometry and aggregated 4D historical spatiotemporal representations to improve spatial reasoning and decision-making in dynamic environments.

Jiasong Xiao, Yutao She, Kai Li + 3 more2026-03-02💻 cs

VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video

The paper introduces VideoPulse, a comprehensive dataset and end-to-end deep learning pipeline that enables accurate, contact-free estimation of neonatal heart rate and SpO2 from facial video, offering a low-cost, non-invasive alternative to traditional adhesive monitoring methods in intensive care settings.

Deependra Dewagiri, Kamesh Anuradha, Pabadhi Liyanage + 6 more2026-03-02⚡ eess

Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models

This paper proposes a novel few-shot 3D vessel segmentation framework that adapts the pre-trained DINOv3 foundation model with specialized 3D components to achieve superior performance and robustness in data-scarce and out-of-distribution clinical scenarios, significantly outperforming state-of-the-art methods like nnU-Net with only five training samples.

Kirato Yoshihara, Yohei Sugawara, Yuta Tokuoka + 1 more2026-03-02⚡ eess

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

The paper proposes Sea2^2, an unsupervised cross-domain adaptation framework that employs a VLM-guided agent to actively navigate and select optimal viewpoints for frozen perception models, thereby significantly improving performance on tasks like visual grounding, segmentation, and 3D box estimation without requiring downstream labels or model retraining.

Tianci Tang, Tielong Cai, Hongwei Wang + 1 more2026-03-02🤖 cs.AI

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

This paper proposes a robust end-to-end multimodal framework for DICOM series classification that leverages bi-directional cross-attention and a sparse, missingness-aware dictionary learning encoder to effectively handle heterogeneous image content, variable series lengths, and incomplete metadata without requiring imputation, thereby outperforming existing baselines in both in-domain and out-of-domain settings.

Tuan Truong, Melanie Dohmen, Sara Lorio + 1 more2026-03-02⚡ eess