cs.SD papers | Gist.Science

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

This paper introduces a novel Speech Generation Speaker Poisoning (SGSP) framework to address privacy risks in zero-shot text-to-speech by modifying trained models to prevent the generation of specific speaker identities while maintaining utility for others, demonstrating effective protection for up to 15 speakers but revealing scalability challenges with larger sets due to identity overlap.

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth NarayananTue, 10 Ma💻 cs

Evaluating Parkinson's Disease Detection in Anonymized Speech: A Performance and Acoustic Analysis

This paper evaluates the trade-off between privacy and Parkinson's disease detection in anonymized speech, demonstrating that while STT-TTS anonymization severely degrades diagnostic performance by erasing prosodic cues, kNN-VC effectively preserves macro-prosodic features to maintain high detection accuracy with only a minor performance drop.

Carlos Franzreb, Francisco Teixeira, Ben Luks, Sebastian Möller, Alberto AbadTue, 10 Ma💻 cs

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

This paper introduces VASR, a multimodal reasoning framework for Context-Aware Visual Speech Recognition (CAVSR) that leverages an Audio-Visual Chain-of-Thought (AV-CoT) to explicitly ground acoustic signals with rich visual context like scenes and on-screen text, thereby overcoming single-modality dominance and achieving state-of-the-art performance.

Wenjie Tian, Mingchen Shao, Bingshen Mu, Xuelong Geng, Chengyou Wang, Yujie Liao, Zhixian Zhao, Ziyu Zhang, Jingbin Hu, Mengqi Wei, Lei XieTue, 10 Ma💻 cs

Toward Multimodal Industrial Fault Analysis: A Single-Speed Chain Conveyor Dataset with Audio and Vibration Signals

This paper introduces a comprehensive multimodal dataset comprising audio and vibration signals from a single-speed chain conveyor system, designed to benchmark robust industrial fault detection and classification under diverse operating conditions and noise levels through standardized evaluation protocols and baseline models.

Zhang Chen, Yucong Zhang, Xiaoxiao Miao, Ming LiTue, 10 Ma💻 cs

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

This paper introduces a mechanistic interpretability approach to identify "audio-specialist" attention heads in large audio-language models, enabling a parameter-free inference-time steering technique that significantly boosts audio grounding and accuracy by amplifying the model's reliance on audio evidence.

Neta Glazer, Lenny Aharon, Ethan FetayaTue, 10 Ma💻 cs

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

This paper introduces Task 5 of the DCASE 2025 Challenge, a multi-domain Audio Question Answering benchmark designed to evaluate and advance the acoustic reasoning capabilities of audio-language models across diverse scenarios including bioacoustics, temporal soundscapes, and complex real-world clips.

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan CatanzaroTue, 10 Ma💬 cs.CL

Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

This paper presents an optimized version of X-Codec-2.0 that reduces the latent rate to 25 Hz and increases the sampling rate to 24 kHz through simple architectural adjustments, achieving superior multilingual speech quality and efficiency compared to the original baseline.

Husein ZolkepliTue, 10 Ma💬 cs.CL

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

DualTurn is a dual-channel generative speech model that learns natural turn-taking dynamics through unsupervised pretraining on conversational audio and fine-tuning to predict agent actions, outperforming existing methods in both action prediction accuracy and turn-boundary anticipation while enabling tool-calling capabilities.

Shangeth RajaaTue, 10 Ma💬 cs.CL

Nw\=ach\=a Mun\=a: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

This paper introduces Nw\=ach\=a Mun\=a, the first manually transcribed Devanagari speech corpus for the endangered Nepal Bhasha, and demonstrates that proximal cross-lingual transfer from Nepali achieves competitive automatic speech recognition performance comparable to large multilingual models while being significantly more computationally efficient.

Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha, Rupak Raj Ghimire, Bal Krishna BalTue, 10 Ma💬 cs.CL

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

The paper introduces ECHO, a novel foundation model that leverages band-split architecture and frequency positional embeddings to achieve state-of-the-art performance in anomaly detection and fault classification across variable-length, arbitrary sampling rate machine signals without requiring padding or cropping.

Yucong Zhang, Juan Liu, Ming LiTue, 10 Ma🤖 cs.LG

BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation

BemaGANv2 is an advanced GAN-based vocoder that enhances long-term audio generation for Text-to-Music and Text-to-Audio applications by integrating Anti-aliased Multi-Periodicity composition modules in the generator and systematically evaluating novel discriminator combination strategies, including the Multi-Envelope Discriminator, to achieve high-fidelity and temporally coherent results.

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul KwonTue, 10 Ma🤖 cs.LG

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

This paper introduces Trilobyte, a byte-level tokenization schema that enables tractable lossless compression of full-fidelity (up to 24-bit) audio using autoregressive language models, demonstrating that while these models outperform FLAC at lower bit depths, their compression gains diminish as bit depth increases.

Phillip Long, Zachary Novack, Chris DonahueTue, 10 Ma🤖 cs.LG

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

FoleyFlow introduces a novel video-to-audio generation framework that achieves superior semantic and rhythmic synchronization by aligning unimodal encoders through masked audio-visual modeling and employing a dynamic conditional flow that utilizes temporally varying video features to guide audio synthesis.

Shentong Mo, Yibing SongTue, 10 Ma🤖 cs.LG

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

This paper introduces an analysis-driven framework that generates a publicly available, 19-hour procedural engine sound dataset with sample-accurate RPM and torque annotations by extracting harmonic structures from real recordings to drive a parametric synthesizer, thereby addressing the scarcity of clean, standardized audio data for automotive sound design and machine learning applications.

Robin Doerfler, Lonce WyseTue, 10 Ma🤖 cs.LG

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

This paper proposes a lightweight, self-supervised framework that augments a frozen speech enhancement backbone with low-rank adapters, enabling efficient on-device adaptation to dynamic real-world noise conditions by updating fewer than 1% of parameters while achieving significant signal quality improvements.

Longbiao Cheng, Shih-Chii LiuTue, 10 Ma🤖 cs.LG

Fast and Flexible Audio Bandwidth Extension via Vocos

This paper proposes a fast and flexible Vocos-based bandwidth extension model that generates missing high-frequency audio content up to 48 kHz using a single neural network and a lightweight refiner, achieving competitive quality with extreme real-time throughput on both GPU and CPU hardware.

Yatharth SharmaTue, 10 Ma🤖 cs.LG

Towards Objective Gastrointestinal Auscultation: Automated Segmentation and Annotation of Bowel Sound Patterns

This study presents an automated pipeline using a wearable SonicGuard sensor and a pretrained Audio Spectrogram Transformer to accurately segment and classify bowel sounds, significantly reducing manual labeling time while providing clinicians with an objective, quantitative tool for assessing gastrointestinal function.

Zahra Mansour, Verena Uslar, Dirk Weyhe, Danilo Hollosi, Nils StrodthoffTue, 10 Ma🤖 cs.LG

Adaptive Discovery of Interpretable Audio Attributes with Multimodal LLMs for Low-Resource Classification

This paper proposes a method that leverages Multimodal Large Language Models to adaptively and rapidly discover interpretable audio attributes within the AdaFlock framework, achieving superior low-resource classification performance compared to direct MLLM prediction and human-driven approaches in under 11 minutes.

Kosuke Yoshimura, Hisashi KashimaTue, 10 Ma🤖 cs.LG

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

The paper introduces Dolphin, an efficient audio-visual speech separation model that utilizes a dual-path lightweight video encoder with discrete lip semantics and a multi-scale global-local attention mechanism to achieve state-of-the-art separation quality while significantly reducing computational costs and inference time.

Kai Li, Kejun Gao, Xiaolin HuThu, 12 Ma💻 cs

Training-Free Multi-Step Inference for Target Speaker Extraction

This paper proposes a training-free, multi-step inference method for target speaker extraction that iteratively refines speech estimates using a frozen pretrained model and introduces joint metric optimization to balance performance across intrusive and non-intrusive evaluation criteria.

Zhenghai You, Ying Shi, Lantian Li, Dong WangThu, 12 Ma💻 cs

← Previous Next →