Culture in Action: Evaluating Text-to-Image Models through Social Activities

This paper introduces CULTIVate, a comprehensive benchmark and evaluation framework designed to assess the cultural faithfulness of text-to-image models in depicting social activities across 16 countries, revealing significant performance disparities between Global North and South regions and demonstrating that its proposed metrics align more closely with human judgment than existing standards.

Sina Malakouti, Boqing Gong, Adriana Kovashka2026-03-09💻 cs

ExpReS-VLA: Specializing Vision-Language-Action Models Through Experience Replay and Retrieval

ExpReS-VLA is a specialized Vision-Language-Action model that enables rapid, memory-efficient on-device adaptation to specific robotic tasks by combining compressed experience replay, retrieval-augmented generation, and a novel contrastive loss to prevent catastrophic forgetting while significantly improving performance on both spatial and long-horizon benchmarks.

Shahram Najam Syed, Yatharth Ahuja, Arthur Jakobsson, Jeff Ichnowski2026-03-09💻 cs

SPARK: Jailbreaking T2V Models by Synergistically Prompting Auditory and Recontextualized Knowledge

This paper introduces SPARK, a jailbreak framework that exploits cross-modal associations in text-to-video models by combining neutral scene anchors, latent auditory triggers, and stylistic modulators to generate semantically unsafe videos that bypass safety guardrails while maintaining a benign appearance.

Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu2026-03-09💻 cs

FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI

The paper introduces FunnyNodules, a fully parameterized synthetic dataset of lung nodule-like shapes with controllable visual attributes and known decision rules, designed to systematically evaluate and benchmark explainable AI models by verifying whether they learn correct attribute-target relations and align their attention with relevant diagnostic features.

Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz2026-03-09💻 cs

EchoVLA: Synergistic Declarative Memory for VLA-Driven Mobile Manipulation

EchoVLA is a memory-enhanced Vision-Language-Action model for mobile manipulation that synergizes scene and episodic declarative memories to improve navigation and task performance, validated by the new MoMani benchmark and demonstrating significant gains over existing baselines in both simulation and real-world settings.

Min Lin, Xiwen Liang, Bingqian Lin, Liu Jingzhi, Zijian Jiao, Kehan Li, Yu Sun, Weijia Liufu, Yuhan Ma, Yuecheng Liu, Shen Zhao, Yuzheng Zhuang, Xiaodan Liang2026-03-09💻 cs

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

SyncMV4D is a novel framework that overcomes the limitations of single-view and data-hungry 3D methods by introducing a Multi-view Joint Diffusion model and a Diffusion Points Aligner to simultaneously generate synchronized, realistic multi-view hand-object interaction videos and globally aligned 4D metric motions through a closed-loop coupling of visual appearance and dynamic geometry.

Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu2026-03-09💻 cs