Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

This paper introduces a novel Speech Generation Speaker Poisoning (SGSP) framework to address privacy risks in zero-shot text-to-speech by modifying trained models to prevent the generation of specific speaker identities while maintaining utility for others, demonstrating effective protection for up to 15 speakers but revealing scalability challenges with larger sets due to identity overlap.

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth NarayananTue, 10 Ma💻 cs

Evaluating Parkinson's Disease Detection in Anonymized Speech: A Performance and Acoustic Analysis

This paper evaluates the trade-off between privacy and Parkinson's disease detection in anonymized speech, demonstrating that while STT-TTS anonymization severely degrades diagnostic performance by erasing prosodic cues, kNN-VC effectively preserves macro-prosodic features to maintain high detection accuracy with only a minor performance drop.

Carlos Franzreb, Francisco Teixeira, Ben Luks, Sebastian Möller, Alberto AbadTue, 10 Ma💻 cs

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

This paper introduces VASR, a multimodal reasoning framework for Context-Aware Visual Speech Recognition (CAVSR) that leverages an Audio-Visual Chain-of-Thought (AV-CoT) to explicitly ground acoustic signals with rich visual context like scenes and on-screen text, thereby overcoming single-modality dominance and achieving state-of-the-art performance.

Wenjie Tian, Mingchen Shao, Bingshen Mu, Xuelong Geng, Chengyou Wang, Yujie Liao, Zhixian Zhao, Ziyu Zhang, Jingbin Hu, Mengqi Wei, Lei XieTue, 10 Ma💻 cs

Toward Multimodal Industrial Fault Analysis: A Single-Speed Chain Conveyor Dataset with Audio and Vibration Signals

This paper introduces a comprehensive multimodal dataset comprising audio and vibration signals from a single-speed chain conveyor system, designed to benchmark robust industrial fault detection and classification under diverse operating conditions and noise levels through standardized evaluation protocols and baseline models.

Zhang Chen, Yucong Zhang, Xiaoxiao Miao, Ming LiTue, 10 Ma💻 cs

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

This paper introduces Task 5 of the DCASE 2025 Challenge, a multi-domain Audio Question Answering benchmark designed to evaluate and advance the acoustic reasoning capabilities of audio-language models across diverse scenarios including bioacoustics, temporal soundscapes, and complex real-world clips.

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan CatanzaroTue, 10 Ma💬 cs.CL

Nw\=ach\=a Mun\=a: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

This paper introduces Nw\=ach\=a Mun\=a, the first manually transcribed Devanagari speech corpus for the endangered Nepal Bhasha, and demonstrates that proximal cross-lingual transfer from Nepali achieves competitive automatic speech recognition performance comparable to large multilingual models while being significantly more computationally efficient.

Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha, Rupak Raj Ghimire, Bal Krishna BalTue, 10 Ma💬 cs.CL

BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation

BemaGANv2 is an advanced GAN-based vocoder that enhances long-term audio generation for Text-to-Music and Text-to-Audio applications by integrating Anti-aliased Multi-Periodicity composition modules in the generator and systematically evaluating novel discriminator combination strategies, including the Multi-Envelope Discriminator, to achieve high-fidelity and temporally coherent results.

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul KwonTue, 10 Ma🤖 cs.LG

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

This paper introduces an analysis-driven framework that generates a publicly available, 19-hour procedural engine sound dataset with sample-accurate RPM and torque annotations by extracting harmonic structures from real recordings to drive a parametric synthesizer, thereby addressing the scarcity of clean, standardized audio data for automotive sound design and machine learning applications.

Robin Doerfler, Lonce WyseTue, 10 Ma🤖 cs.LG

Towards Objective Gastrointestinal Auscultation: Automated Segmentation and Annotation of Bowel Sound Patterns

This study presents an automated pipeline using a wearable SonicGuard sensor and a pretrained Audio Spectrogram Transformer to accurately segment and classify bowel sounds, significantly reducing manual labeling time while providing clinicians with an objective, quantitative tool for assessing gastrointestinal function.

Zahra Mansour, Verena Uslar, Dirk Weyhe, Danilo Hollosi, Nils StrodthoffTue, 10 Ma🤖 cs.LG