Rethinking Discrete Speech Representation Tokens for Accent Generation

This paper presents the first systematic investigation into how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), introducing a unified evaluation framework that reveals layer selection is the most critical factor for retaining accents, while ASR supervision significantly diminishes them and naive codebook reduction fails to disentangle accent from phonetic and speaker information.

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter BellWed, 11 Ma⚡ eess

Evaluating pretrained speech embedding systems for dysarthria detection across heterogenous datasets

This paper comprehensively evaluates 17 pretrained speech embedding systems across six heterogeneous datasets for dysarthria detection, revealing significant variability in within-dataset performance and limited cross-dataset generalization, which raises critical questions about the clinical validity of models trained and tested on the same data.

Lovisa Wihlborg, Jemima Goodall, David Wheatley, Jacob J. Webber, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Sohan Seth, Oliver Watts, Cassia Valentini-BotinhaoWed, 11 Ma⚡ eess

Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks

This paper introduces a systematic paradigm for benchmarking humans and machines on multilingual speech understanding tasks, revealing that while speech-based large language models match or exceed human performance in clean, single-speaker conditions, humans significantly outperform them in selectively attending to target speakers within complex, mixed-channel acoustic scenes, particularly in non-native languages.

Sai Samrat Kankanala, Ram Chandra, Sriram GanapathyWed, 11 Ma⚡ eess

Fast-Converging Distributed Signal Estimation in Topology-Unconstrained Wireless Acoustic Sensor Networks

This paper proposes TI-DANSE+, an improved distributed signal estimation algorithm for topology-unconstrained wireless acoustic sensor networks that accelerates convergence by utilizing partial in-network sums and a tree-pruning strategy, while maintaining robustness to link failures and reducing communication bandwidth compared to existing methods.

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Marc MoonenWed, 11 Ma⚡ eess

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

This paper introduces the first exemplar-free continual learning benchmark for Audio-Visual Segmentation (AVS) and proposes the ATLAS baseline, which utilizes audio-guided pre-fusion conditioning and Low-Rank Anchoring to effectively mitigate catastrophic forgetting in dynamic, evolving environments.

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing ZhuWed, 11 Ma⚡ eess

How Contrastive Decoding Enhances Large Audio Language Models?

This paper systematically evaluates four Contrastive Decoding strategies across diverse Large Audio Language Models, identifying Audio-Aware and Audio Contrastive Decoding as most effective while introducing a Transition Matrix framework to demonstrate that these methods successfully rectify specific error patterns like false audio absence claims but fail to correct flawed reasoning or confident misassertions.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi LeeWed, 11 Ma💬 cs.CL

Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks

This paper proposes the distributed multichannel Wiener filter (dMWF), a non-iterative algorithm for wireless acoustic sensor networks that achieves optimal, centralized-level speech estimation performance with reduced communication bandwidth, even when nodes observe different sets of sources, thereby outperforming existing iterative solutions like DANSE.

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Pourya Behmandpoor, Henri Gode, Marc MoonenWed, 11 Ma⚡ eess

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

This paper introduces DRES, a 1.5-hour semi-spontaneous Dutch speech dataset recorded in noisy public indoor environments, and evaluates its utility by demonstrating that while several state-of-the-art ASR models achieve competitive performance, modern single-channel speech enhancement algorithms fail to improve recognition accuracy in these realistic conditions.

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette ScharenborgWed, 11 Ma⚡ eess