eess.AS papers | Gist.Science

Head, posture, and full-body gestures in unscripted dyadic conversations in noise

This study demonstrates that in noisy dyadic conversations, speakers adapt by increasing hand-gesture complexity and speech volume while listeners enhance backchanneling, with background noise also modulating head and trunk movements and slightly reducing hand-speech synchrony.

Luboš Hládek, Bernhard U. SeeberWed, 11 Ma⚡ eess

Modeling strategies for speech enhancement in the latent space of a neural audio codec

This paper investigates speech enhancement strategies within the latent space of neural audio codecs, demonstrating that predicting continuous latent representations with non-autoregressive models and fine-tuning the encoder yields the best overall performance, despite a trade-off in codec reconstruction quality.

Sofiene Kammoun, Xavier Alameda-Pineda, Simon LeglaiveWed, 11 Ma⚡ eess

Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification

This paper proposes a noise-conditioned mixture-of-experts framework for robust speaker verification that decomposes the feature space into specialized noise-aware subspaces using a routing mechanism, expert specialization strategy, and SNR-decaying curriculum learning to outperform conventional unified representation methods under diverse noisy conditions.

Bin Gu, Haitao Zhao, Jibo WeiWed, 11 Ma⚡ eess

Rethinking Discrete Speech Representation Tokens for Accent Generation

This paper presents the first systematic investigation into how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), introducing a unified evaluation framework that reveals layer selection is the most critical factor for retaining accents, while ASR supervision significantly diminishes them and naive codebook reduction fails to disentangle accent from phonetic and speaker information.

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter BellWed, 11 Ma⚡ eess

Multiplexing Neural Audio Watermarks

This paper introduces a multiplexing paradigm for audio watermarking that combines multiple techniques, including the training-free Perceptual-Adaptive Time-Frequency Multiplexing (PA-TFM) and the model-based MaskNet, to significantly enhance robustness against sophisticated distortions and adversarial attacks compared to existing single-watermark schemes.

Zheqi Yuan, Yucheng Huang, Guangzhi Sun, Zengrui Jin, Chao ZhangWed, 11 Ma⚡ eess

WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion

WhisperVC is a three-stage, low-resource framework that decouples cross-domain alignment from speech generation to effectively convert whispered speech into natural-sounding normal speech by leveraging domain-invariant semantic representations and speaker-conditioned timbre modeling.

Dong Liu, Juan Liu, Wei Ju, Yao Tian, Ming LiWed, 11 Ma⚡ eess

Evaluating pretrained speech embedding systems for dysarthria detection across heterogenous datasets

This paper comprehensively evaluates 17 pretrained speech embedding systems across six heterogeneous datasets for dysarthria detection, revealing significant variability in within-dataset performance and limited cross-dataset generalization, which raises critical questions about the clinical validity of models trained and tested on the same data.

Lovisa Wihlborg, Jemima Goodall, David Wheatley, Jacob J. Webber, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Sohan Seth, Oliver Watts, Cassia Valentini-BotinhaoWed, 11 Ma⚡ eess

Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks

This paper introduces a systematic paradigm for benchmarking humans and machines on multilingual speech understanding tasks, revealing that while speech-based large language models match or exceed human performance in clean, single-speaker conditions, humans significantly outperform them in selectively attending to target speakers within complex, mixed-channel acoustic scenes, particularly in non-native languages.

Sai Samrat Kankanala, Ram Chandra, Sriram GanapathyWed, 11 Ma⚡ eess

Human-CLAP: Human-perception-based contrastive language-audio pretraining

This paper introduces Human-CLAP, a human-perception-based contrastive language-audio model trained on subjective evaluation scores to significantly improve the correlation between automated CLAP scores and human judgments, addressing the previously low alignment of standard CLAP metrics with human perception.

Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi SaruwatariWed, 11 Ma⚡ eess

Fast-Converging Distributed Signal Estimation in Topology-Unconstrained Wireless Acoustic Sensor Networks

This paper proposes TI-DANSE+, an improved distributed signal estimation algorithm for topology-unconstrained wireless acoustic sensor networks that accelerates convergence by utilizing partial in-network sums and a tree-pruning strategy, while maintaining robustness to link failures and reducing communication bandwidth compared to existing methods.

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Marc MoonenWed, 11 Ma⚡ eess

Textless and Non-Parallel Speech-to-Speech Emotion Style Transfer

This paper proposes S2S-ZEST, a textless and non-parallel speech-to-speech framework that achieves zero-shot emotion style transfer by extracting semantic, speaker, and emotion representations to synthesize speech that preserves the source's content and identity while adopting the reference's emotional style.

Soumya Dutta, Avni Jain, Sriram GanapathyWed, 11 Ma⚡ eess

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

This paper introduces the first exemplar-free continual learning benchmark for Audio-Visual Segmentation (AVS) and proposes the ATLAS baseline, which utilizes audio-guided pre-fusion conditioning and Low-Rank Anchoring to effectively mitigate catastrophic forgetting in dynamic, evolving environments.

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing ZhuWed, 11 Ma⚡ eess

How Contrastive Decoding Enhances Large Audio Language Models?

This paper systematically evaluates four Contrastive Decoding strategies across diverse Large Audio Language Models, identifying Audio-Aware and Audio Contrastive Decoding as most effective while introducing a Transition Matrix framework to demonstrate that these methods successfully rectify specific error patterns like false audio absence claims but fail to correct flawed reasoning or confident misassertions.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi LeeWed, 11 Ma💬 cs.CL

Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks

This paper proposes the distributed multichannel Wiener filter (dMWF), a non-iterative algorithm for wireless acoustic sensor networks that achieves optimal, centralized-level speech estimation performance with reduced communication bandwidth, even when nodes observe different sets of sources, thereby outperforming existing iterative solutions like DANSE.

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Pourya Behmandpoor, Henri Gode, Marc MoonenWed, 11 Ma⚡ eess

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

This paper introduces DRES, a 1.5-hour semi-spontaneous Dutch speech dataset recorded in noisy public indoor environments, and evaluates its utility by demonstrating that while several state-of-the-art ASR models achieve competitive performance, modern single-channel speech enhancement algorithms fail to improve recognition accuracy in these realistic conditions.

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette ScharenborgWed, 11 Ma⚡ eess

Finetuning a Text-to-Audio Model for Room Impulse Response Generation

This paper proposes a novel method for generating Room Impulse Responses by fine-tuning a pre-trained text-to-audio model, utilizing vision-language models to create text-RIR pairs and in-context learning to handle free-form prompts, thereby producing plausible acoustic simulations for applications like speech data augmentation.

Kirak Kim, Sungyoung KimWed, 11 Ma⚡ eess

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

Speech-Omni-Lite is a cost-efficient framework that extends frozen visual-language backbones with lightweight, trainable speech modules and a novel data construction strategy to achieve high-performance spoken QA comparable to massive omni-models, using only thousands of hours of training data.

Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao ChenWed, 11 Ma⚡ eess

A Fast Solver for Interpolating Stochastic Differential Equation Diffusion Models for Speech Restoration

This paper introduces a formalism for interpolating Stochastic Differential Equations (iSDEs) and proposes a novel fast solver that reduces the computational cost of speech restoration models like SGMSE+ to as few as 10 neural network evaluations by adapting fast sampling techniques to the unique interpolation-based diffusion process.

Bunlong Lay, Timo GerkmannWed, 11 Ma⚡ eess

End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments

This paper proposes an end-to-end multi-channel keyword spotting framework that integrates a spatial encoder and directional priors to achieve superior noise robustness and performance compared to conventional single-channel or cascaded systems in complex acoustic environments.

Rui Wang, Zhifei Zhang, Yu Gao, Xiaofeng Mou, Yi XuWed, 11 Ma⚡ eess

StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement

The paper introduces StuPASE, a generative speech enhancement model that achieves studio-quality output with low hallucination by fine-tuning PASE with dry targets and replacing its GAN module with flow matching to handle strong additive noise.

Xiaobin Rong, Jun Gao, Zheng Wang, Mansur Yesilbursa, Kamil Wojcicki, Jing LuWed, 11 Ma⚡ eess