Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

This paper proposes a large language model-driven method for generating dynamic, semantically aligned speech and gestures for pedagogical agents in virtual reality, demonstrating through user experience experiments that such multimodal expressions significantly enhance learning effectiveness, engagement, and social presence while reducing fatigue and boredom.

Ninghao Wan, Jiarun Song, Fuzheng YangWed, 11 Ma💻 cs

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

The paper introduces MORE-R1, a novel Large Vision-Language Model that leverages a two-stage training process combining Supervised Fine-Tuning on automatically constructed stepwise reasoning data and Reinforcement Learning with Group Relative Policy Optimization to achieve state-of-the-art performance in Multimodal Object-Entity Relation Extraction.

Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong MoWed, 11 Ma💻 cs

MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

The MEGC 2026 challenge introduces two new tasks, Micro-Expression Video Question Answering (ME-VQA) and Micro-Expression Long-Video Question Answering (ME-LVQA), to advance the analysis of facial micro-expressions by leveraging the multimodal reasoning capabilities of large vision-language models on both short and long-duration video sequences.

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. DavisonWed, 11 Ma💻 cs

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

The paper introduces VoxEmo, a comprehensive benchmark and toolkit for evaluating Speech Large Language Models on speech emotion recognition across 35 corpora and 15 languages, featuring a distribution-aware soft-label protocol that reveals how these models uniquely align with human subjective emotion distributions despite trailing supervised baselines in hard-label accuracy.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas HainWed, 11 Ma🤖 cs.AI

Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds

This paper proposes a novel spectrogram-based Convolutional Neural Network (CNN) approach for multilabel environmental sound classification that significantly outperforms traditional MFCC-based methods on the South Asian SAS-KIIT and UrbanSound8K datasets, offering a more robust solution for complex, overlapping acoustic environments.

Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee, Tathagata Bandyopadhyay, Digonto Biswas, Bibek HowladerTue, 10 Ma💻 cs

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

This paper proposes a two-stage cascaded framework that generates controllable complex human motion videos by first using an autoregressive model to synthesize 2D skeleton sequences from text descriptions and then employing a pose-conditioned diffusion model with adaptive layer fusion to render high-fidelity videos, supported by a new synthetic dataset designed to overcome limitations in existing benchmarks.

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed BennamounTue, 10 Ma💻 cs

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

The paper introduces CONSTANT, a novel one-shot handwriting generation framework that leverages Style-Aware Quantization and a latent patch-based contrastive objective within a diffusion model to overcome existing limitations in capturing diverse writer styles and generating high-quality, realistic handwritten images across multiple languages.

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, Xuan Toan Mai, Tuan-Anh TranTue, 10 Ma💻 cs

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

This paper introduces Task 5 of the DCASE 2025 Challenge, a multi-domain Audio Question Answering benchmark designed to evaluate and advance the acoustic reasoning capabilities of audio-language models across diverse scenarios including bioacoustics, temporal soundscapes, and complex real-world clips.

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan CatanzaroTue, 10 Ma💬 cs.CL

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

This paper introduces TimeSpot, a comprehensive benchmark comprising 1,455 real-world images from 80 countries designed to evaluate the limited geo-temporal reasoning capabilities of current vision-language models in predicting location, time, and environmental context from visual evidence alone.

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan ParvezTue, 10 Ma💬 cs.CL

Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

The paper introduces Emotion Collider (EC-Net), a hyperbolic hypergraph framework that leverages Poincaré-ball embeddings, bidirectional message passing, and contrastive learning to achieve robust and noise-resilient multimodal sentiment analysis by preserving high-order semantic relations and enhancing class separation.

Rong Fu, Ziming Wang, Shuo Yin, Haiyun Wei, Kun Liu, Xianda Li, Zeli Su, Simon FongTue, 10 Ma🤖 cs.LG