cs.SD papers | Gist.Science

Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound

This paper introduces the first formal framework for Audio-Visual World Models (AVWM), presenting the AVW-4k dataset and the AV-CDiT model to enable high-fidelity, synchronized simulation of binaural audio and visual dynamics that significantly enhances agent performance in continuous navigation tasks.

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin MaoWed, 11 Ma💻 cs

LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment

LARA-Gen introduces a framework for continuous, fine-grained emotion control in music generation by aligning latent affective representations with an external emotion predictor and utilizing a valence-arousal control module, thereby overcoming the limitations of text-based prompting and significantly improving both emotional adherence and music quality.

Jiahao Mei, Xuenan Xu, Zeyu Xie, Zihao Zheng, Ye Tao, Yue Ding, Mengyue WuWed, 11 Ma💻 cs

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

This paper introduces EmoSURA, a novel evaluation framework that improves the assessment of long-form emotional speech captions by decomposing them into atomic perceptual units for audio-grounded verification, addressing the limitations of traditional metrics and LLM judges while providing the standardized SURABench resource.

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Björn SchullerWed, 11 Ma💻 cs

Paralinguistic Emotion-Aware Validation Timing Detection in Japanese Empathetic Spoken Dialogue

This paper proposes a paralinguistic- and emotion-aware model that detects optimal timing for emotional validation in Japanese empathetic dialogue using only speech signals, achieving significant improvements over conventional baselines by integrating self-supervised paralinguistic and multi-task emotion classification encoders.

Zi Haur Pang, Yahui Fu, Yuan Gao, Tatsuya KawaharaWed, 11 Ma💻 cs

Head, posture, and full-body gestures in unscripted dyadic conversations in noise

This study demonstrates that in noisy dyadic conversations, speakers adapt by increasing hand-gesture complexity and speech volume while listeners enhance backchanneling, with background noise also modulating head and trunk movements and slightly reducing hand-speech synchrony.

Luboš Hládek, Bernhard U. SeeberWed, 11 Ma⚡ eess

Modeling strategies for speech enhancement in the latent space of a neural audio codec

This paper investigates speech enhancement strategies within the latent space of neural audio codecs, demonstrating that predicting continuous latent representations with non-autoregressive models and fine-tuning the encoder yields the best overall performance, despite a trade-off in codec reconstruction quality.

Sofiene Kammoun, Xavier Alameda-Pineda, Simon LeglaiveWed, 11 Ma⚡ eess

Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification

This paper proposes a noise-conditioned mixture-of-experts framework for robust speaker verification that decomposes the feature space into specialized noise-aware subspaces using a routing mechanism, expert specialization strategy, and SNR-decaying curriculum learning to outperform conventional unified representation methods under diverse noisy conditions.

Bin Gu, Haitao Zhao, Jibo WeiWed, 11 Ma⚡ eess

Rethinking Discrete Speech Representation Tokens for Accent Generation

This paper presents the first systematic investigation into how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), introducing a unified evaluation framework that reveals layer selection is the most critical factor for retaining accents, while ASR supervision significantly diminishes them and naive codebook reduction fails to disentangle accent from phonetic and speaker information.

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter BellWed, 11 Ma⚡ eess

Human-CLAP: Human-perception-based contrastive language-audio pretraining

This paper introduces Human-CLAP, a human-perception-based contrastive language-audio model trained on subjective evaluation scores to significantly improve the correlation between automated CLAP scores and human judgments, addressing the previously low alignment of standard CLAP metrics with human perception.

Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi SaruwatariWed, 11 Ma⚡ eess

Fast-Converging Distributed Signal Estimation in Topology-Unconstrained Wireless Acoustic Sensor Networks

This paper proposes TI-DANSE+, an improved distributed signal estimation algorithm for topology-unconstrained wireless acoustic sensor networks that accelerates convergence by utilizing partial in-network sums and a tree-pruning strategy, while maintaining robustness to link failures and reducing communication bandwidth compared to existing methods.

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Marc MoonenWed, 11 Ma⚡ eess

Textless and Non-Parallel Speech-to-Speech Emotion Style Transfer

This paper proposes S2S-ZEST, a textless and non-parallel speech-to-speech framework that achieves zero-shot emotion style transfer by extracting semantic, speaker, and emotion representations to synthesize speech that preserves the source's content and identity while adopting the reference's emotional style.

Soumya Dutta, Avni Jain, Sriram GanapathyWed, 11 Ma⚡ eess

How Contrastive Decoding Enhances Large Audio Language Models?

This paper systematically evaluates four Contrastive Decoding strategies across diverse Large Audio Language Models, identifying Audio-Aware and Audio Contrastive Decoding as most effective while introducing a Transition Matrix framework to demonstrate that these methods successfully rectify specific error patterns like false audio absence claims but fail to correct flawed reasoning or confident misassertions.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi LeeWed, 11 Ma💬 cs.CL

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

This paper demonstrates that neural audio codecs achieve optimal adversarial robustness in speech recognition at intermediate residual vector quantization depths, which effectively balance the suppression of adversarial perturbations with the preservation of speech content, outperforming traditional compression defenses.

Jordan Prescott, Thanathai Lertpetchpun, Shrikanth NarayananWed, 11 Ma⚡ eess

Universal Speech Content Factorization

The paper proposes Universal Speech Content Factorization (USCF), a simple and invertible linear method that extracts low-rank, speaker-independent speech representations to enable competitive zero-shot voice conversion and efficient training of timbre-prompted text-to-speech models using minimal target speaker data.

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew WiesnerWed, 11 Ma⚡ eess

The Costs of Reproducibility in Music Separation Research: a Replication of Band-Split RNN

This paper addresses the reproducibility crisis in music source separation by attempting to replicate the Band-Split RNN model, ultimately releasing an optimized version with improved performance and publicly available code to advocate for more transparent and sustainable research practices.

Paul Magron, Romain Serizel, Constance DouwesWed, 11 Ma🤖 cs.LG

VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

VoiceBridge is a novel one-step latent bridge model that leverages an energy-preserving variational autoencoder and a joint neural prior to efficiently reconstruct high-quality 48 kHz fullband speech from diverse distortions across various in-domain and out-of-domain general speech restoration tasks without requiring distillation.

Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun ZhuWed, 11 Ma🤖 cs.AI

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

VSSFlow introduces a unified flow-matching framework that seamlessly integrates Video-to-Sound and Visual Text-to-Speech generation through a disentangled condition aggregation mechanism, demonstrating that joint learning can surpass specialized state-of-the-art baselines without performance degradation.

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua SongWed, 11 Ma🤖 cs.AI

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

This paper introduces SCENEBench, a comprehensive benchmark suite designed to evaluate Large Audio Language Models on critical non-speech and cross-lingual audio understanding tasks relevant to assistive and industrial applications, revealing significant performance gaps and latency challenges in current state-of-the-art models.

Laya Iyer, Angelina Wang, Sanmi KoyejoWed, 11 Ma🤖 cs.AI

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi LeeWed, 11 Ma🤖 cs.AI

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

This paper introduces the Pulse-Train-Resonator (PTR), a differentiable synthesis model that improves engine sound generation by directly modeling sequential exhaust pressure pulses and physical resonances rather than approximating spectral characteristics, achieving superior reconstruction accuracy and interpretability across diverse engine types.

Robin Doerfler, Lonce WyseWed, 11 Ma🤖 cs.AI