LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment

LARA-Gen introduces a framework for continuous, fine-grained emotion control in music generation by aligning latent affective representations with an external emotion predictor and utilizing a valence-arousal control module, thereby overcoming the limitations of text-based prompting and significantly improving both emotional adherence and music quality.

Jiahao Mei, Xuenan Xu, Zeyu Xie, Zihao Zheng, Ye Tao, Yue Ding, Mengyue WuWed, 11 Ma💻 cs

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

This paper introduces EmoSURA, a novel evaluation framework that improves the assessment of long-form emotional speech captions by decomposing them into atomic perceptual units for audio-grounded verification, addressing the limitations of traditional metrics and LLM judges while providing the standardized SURABench resource.

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Björn SchullerWed, 11 Ma💻 cs

Rethinking Discrete Speech Representation Tokens for Accent Generation

This paper presents the first systematic investigation into how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), introducing a unified evaluation framework that reveals layer selection is the most critical factor for retaining accents, while ASR supervision significantly diminishes them and naive codebook reduction fails to disentangle accent from phonetic and speaker information.

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter BellWed, 11 Ma⚡ eess

Fast-Converging Distributed Signal Estimation in Topology-Unconstrained Wireless Acoustic Sensor Networks

This paper proposes TI-DANSE+, an improved distributed signal estimation algorithm for topology-unconstrained wireless acoustic sensor networks that accelerates convergence by utilizing partial in-network sums and a tree-pruning strategy, while maintaining robustness to link failures and reducing communication bandwidth compared to existing methods.

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Marc MoonenWed, 11 Ma⚡ eess

How Contrastive Decoding Enhances Large Audio Language Models?

This paper systematically evaluates four Contrastive Decoding strategies across diverse Large Audio Language Models, identifying Audio-Aware and Audio Contrastive Decoding as most effective while introducing a Transition Matrix framework to demonstrate that these methods successfully rectify specific error patterns like false audio absence claims but fail to correct flawed reasoning or confident misassertions.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi LeeWed, 11 Ma💬 cs.CL

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

This paper demonstrates that neural audio codecs achieve optimal adversarial robustness in speech recognition at intermediate residual vector quantization depths, which effectively balance the suppression of adversarial perturbations with the preservation of speech content, outperforming traditional compression defenses.

Jordan Prescott, Thanathai Lertpetchpun, Shrikanth NarayananWed, 11 Ma⚡ eess

Universal Speech Content Factorization

The paper proposes Universal Speech Content Factorization (USCF), a simple and invertible linear method that extracts low-rank, speaker-independent speech representations to enable competitive zero-shot voice conversion and efficient training of timbre-prompted text-to-speech models using minimal target speaker data.

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew WiesnerWed, 11 Ma⚡ eess

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

VSSFlow introduces a unified flow-matching framework that seamlessly integrates Video-to-Sound and Visual Text-to-Speech generation through a disentangled condition aggregation mechanism, demonstrating that joint learning can surpass specialized state-of-the-art baselines without performance degradation.

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua SongWed, 11 Ma🤖 cs.AI

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi LeeWed, 11 Ma🤖 cs.AI