cs.MM papers | Gist.Science

Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound

This paper introduces the first formal framework for Audio-Visual World Models (AVWM), presenting the AVW-4k dataset and the AV-CDiT model to enable high-fidelity, synchronized simulation of binaural audio and visual dynamics that significantly enhances agent performance in continuous navigation tasks.

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin MaoWed, 11 Ma💻 cs

Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

This paper introduces DynHiL-EQA, a new dataset for dynamic human-in-the-loop Embodied Question Answering, and proposes DIVRR, a training-free framework that enhances robustness and inference efficiency by refining ambiguous views and selectively managing memory in dynamic environments.

Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong WangWed, 11 Ma💻 cs

Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

This paper proposes a large language model-driven method for generating dynamic, semantically aligned speech and gestures for pedagogical agents in virtual reality, demonstrating through user experience experiments that such multimodal expressions significantly enhance learning effectiveness, engagement, and social presence while reducing fatigue and boredom.

Ninghao Wan, Jiarun Song, Fuzheng YangWed, 11 Ma💻 cs

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

The paper introduces MORE-R1, a novel Large Vision-Language Model that leverages a two-stage training process combining Supervised Fine-Tuning on automatically constructed stepwise reasoning data and Reinforcement Learning with Group Relative Policy Optimization to achieve state-of-the-art performance in Multimodal Object-Entity Relation Extraction.

Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong MoWed, 11 Ma💻 cs

Latency Effects on Multi-Dimensional QoE in Networked VR Whiteboards

This study systematically investigates how network latency degrades the pragmatic and hedonic dimensions of Quality of Experience in networked VR whiteboards by analyzing their varying impacts across sequential and free collaboration modes, as well as between avatar-enhanced and traditional VR or PC-based systems.

Jiarun Song, Yongkang Hou, Fuzheng YangWed, 11 Ma💻 cs

TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR Collaboration

This paper proposes the Task-Aware Perceptual Interaction Fluency Model (TPIFM), which evaluates the perceived responsiveness of remote collaborative augmented reality by integrating network impairments with task-specific just-noticeable differences derived from the Free Energy Principle to guide adaptive system design.

Jiarun Song, Ninghao Wan, Fuzheng Yang, Weisi LinWed, 11 Ma💻 cs

From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR Conferencing

This paper investigates how end-to-end latency impacts interaction fluency and social presence in VR conferencing compared to traditional video conferencing through subjective experiments, aiming to clarify the underlying perceptual and cognitive mechanisms to guide system optimization.

Jiarun Song, Ninghao Wan, FuZheng Yang, Weisi LinWed, 11 Ma💻 cs

Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification

This paper proposes a noise-conditioned mixture-of-experts framework for robust speaker verification that decomposes the feature space into specialized noise-aware subspaces using a routing mechanism, expert specialization strategy, and SNR-decaying curriculum learning to outperform conventional unified representation methods under diverse noisy conditions.

Bin Gu, Haitao Zhao, Jibo WeiWed, 11 Ma⚡ eess

MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

The MEGC 2026 challenge introduces two new tasks, Micro-Expression Video Question Answering (ME-VQA) and Micro-Expression Long-Video Question Answering (ME-LVQA), to advance the analysis of facial micro-expressions by leveraging the multimodal reasoning capabilities of large vision-language models on both short and long-duration video sequences.

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. DavisonWed, 11 Ma💻 cs

Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

This paper introduces CDGLT, a training-efficient framework for multimodal metaphor identification that leverages Concept Drift via Spherical Linear Interpolation and adapted LayerNorm tuning to achieve state-of-the-art performance on the MET-Meme benchmark while significantly reducing computational costs compared to existing generative methods.

Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia LiWed, 11 Ma🤖 cs.LG

Singing Syllabi with Virtual Avatars: Enhancing Student Engagement Through AI-Generated Music and Digital Embodiment

This paper proposes and evaluates a novel educational approach that uses AI-generated singing and virtual avatars to transform traditional text-based syllabi into engaging audiovisual performances, demonstrating that this method significantly improves student awareness and recall of critical course information.

Xinxing WuWed, 11 Ma🤖 cs.AI

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

The paper introduces VoxEmo, a comprehensive benchmark and toolkit for evaluating Speech Large Language Models on speech emotion recognition across 35 corpora and 15 languages, featuring a distribution-aware soft-label protocol that reveals how these models uniquely align with human subjective emotion distributions despite trailing supervised baselines in hard-label accuracy.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas HainWed, 11 Ma🤖 cs.AI

Improving Visual Object Tracking through Visual Prompting

The paper proposes PiVOT, a visual prompting mechanism that leverages a pretrained CLIP foundation model to automatically generate and refine online visual prompts, thereby enhancing generic object tracking by effectively suppressing distractors through contrastive guidance.

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu LinTue, 10 Ma💻 cs

Scalable On-the-fly Transcoding for Adaptive Streaming of Dynamic Point Clouds

This paper presents and evaluates a scalable dynamic point cloud streaming system that leverages on-the-fly transcoding, demonstrating how caching and speculative transcoding significantly reduce server loads and improve user Quality of Experience to support a higher number of simultaneous clients.

Michael Rudolph, Matthias De Fré, Finn Schnier, Tim Wauter, Amr RizkTue, 10 Ma💻 cs

Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds

This paper proposes a novel spectrogram-based Convolutional Neural Network (CNN) approach for multilabel environmental sound classification that significantly outperforms traditional MFCC-based methods on the South Asian SAS-KIIT and UrbanSound8K datasets, offering a more robust solution for complex, overlapping acoustic environments.

Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee, Tathagata Bandyopadhyay, Digonto Biswas, Bibek HowladerTue, 10 Ma💻 cs

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

This paper proposes a two-stage cascaded framework that generates controllable complex human motion videos by first using an autoregressive model to synthesize 2D skeleton sequences from text descriptions and then employing a pose-conditioned diffusion model with adaptive layer fusion to render high-fidelity videos, supported by a new synthetic dataset designed to overcome limitations in existing benchmarks.

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed BennamounTue, 10 Ma💻 cs

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

The paper introduces CONSTANT, a novel one-shot handwriting generation framework that leverages Style-Aware Quantization and a latent patch-based contrastive objective within a diffusion model to overcome existing limitations in capturing diverse writer styles and generating high-quality, realistic handwritten images across multiple languages.

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, Xuan Toan Mai, Tuan-Anh TranTue, 10 Ma💻 cs

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

This paper introduces Task 5 of the DCASE 2025 Challenge, a multi-domain Audio Question Answering benchmark designed to evaluate and advance the acoustic reasoning capabilities of audio-language models across diverse scenarios including bioacoustics, temporal soundscapes, and complex real-world clips.

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan CatanzaroTue, 10 Ma💬 cs.CL

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

This paper introduces TimeSpot, a comprehensive benchmark comprising 1,455 real-world images from 80 countries designed to evaluate the limited geo-temporal reasoning capabilities of current vision-language models in predicting location, time, and environmental context from visual evidence alone.

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan ParvezTue, 10 Ma💬 cs.CL

Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

The paper introduces Emotion Collider (EC-Net), a hyperbolic hypergraph framework that leverages Poincaré-ball embeddings, bidirectional message passing, and contrastive learning to achieve robust and noise-resilient multimodal sentiment analysis by preserving high-order semantic relations and enhancing class separation.

Rong Fu, Ziming Wang, Shuo Yin, Haiyun Wei, Kun Liu, Xianda Li, Zeli Su, Simon FongTue, 10 Ma🤖 cs.LG