AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

This paper introduces AutoViVQA, a large-scale automatically constructed dataset for Vietnamese Visual Question Answering, and evaluates transformer-based multimodal models alongside various automatic metrics to assess their performance and alignment with human judgment in the Vietnamese context.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le2026-03-11🤖 cs.AI

ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling

The paper proposes ESAinsTOD, a unified end-to-end schema-aware instruction-tuning framework that leverages full-parameter LLM fine-tuning with instruction and schema alignment mechanisms to achieve superior performance, generalization in low-resource settings, and robustness against noise across diverse task-oriented dialog benchmarks.

Dechuan Teng, Chunlin Lu, Libo Qin, Wanxiang Che2026-03-11🤖 cs.AI

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

The paper introduces ActiveUltraFeedback, an efficient active learning pipeline that leverages uncertainty estimates and novel selection strategies like Double Reverse Thompson Sampling to generate high-quality preference data, enabling Large Language Models to achieve superior alignment performance with as little as one-sixth of the annotated data required by static baselines.

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause2026-03-11🤖 cs.AI

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

The paper introduces OOD-MMSafe, a benchmark revealing significant causal blindness in current Multimodal Large Language Models regarding hidden consequences, and proposes the Consequence-Aware Safety Policy Optimization (CASPO) framework to effectively mitigate these risks by shifting safety alignment from intent detection to consequence projection.

Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma2026-03-11🤖 cs.AI

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee2026-03-11🤖 cs.AI

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

This paper introduces EXPLORE-Bench, a benchmark derived from real first-person videos to evaluate the ability of multimodal large language models to perform long-horizon egocentric scene prediction, revealing significant performance gaps compared to humans and demonstrating that stepwise reasoning offers partial improvements at a computational cost.

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha2026-03-11🤖 cs.AI

Ego: Embedding-Guided Personalization of Vision-Language Models

The paper proposes "Ego," an efficient personalization method for vision-language models that extracts visual tokens representing target concepts via internal attention mechanisms to serve as memory, enabling strong performance across single-concept, multi-concept, and video personalization tasks without requiring additional training stages or external modules.

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi2026-03-11🤖 cs.AI

World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models

World2Mind is a training-free toolkit that enhances foundation models' allocentric spatial reasoning by constructing structured cognitive maps and an Allocentric-Spatial Tree, enabling significant performance gains and even allowing text-only models to achieve complex 3D spatial reasoning comparable to advanced multimodal systems.

Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Hang Su, Yubin Wang2026-03-11🤖 cs.AI

First Estimation of Model Parameters for Neutrino-Induced Nucleon Knockout Using Simulation-Based Inference

This paper demonstrates that simulation-based inference (SBI) is a viable and potentially superior alternative to traditional empirical tuning for determining neutrino interaction model parameters, as it successfully reproduces and slightly improves upon the MicroBooNE collaboration's tuned GENIE configuration while also approximating the NuWro simulation.

Karla Tame-Narvaez, Steven Gardiner, Aleksandra Ciprijanovic, Giuseppe Cerati2026-03-11⚛️ hep-ph

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

This paper introduces MA-EgoQA, a novel benchmark and dataset featuring 1,700 questions across five categories designed to evaluate the ability of AI models to understand and reason over multiple long-horizon egocentric videos from embodied agents, alongside a proposed baseline model named EgoMAS that highlights current limitations in system-level multi-agent understanding.

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang2026-03-11🤖 cs.AI