cs.SD papers | Gist.Science

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

This paper introduces Geo-ATBench, a new benchmark and the Geo-AT task that leverage geospatial semantic context to resolve acoustic ambiguities in multi-label audio tagging, demonstrating through the GeoFusion-AT framework that incorporating location-based priors significantly improves recognition performance and aligns with human judgment.

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, Dick BotteldoorenThu, 12 Ma⚡ eess

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

The paper proposes G-STAR, an end-to-end system that integrates a time-aware speaker-tracking module with a Speech-LLM backbone to achieve robust, timestamped speaker-attributed recognition for long-form, overlapping multi-party speech while maintaining global identity consistency.

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai WangThu, 12 Ma⚡ eess

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

The paper introduces FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition system that unifies high-performance modules for speech transcription, voice activity detection, language identification, and punctuation prediction, achieving superior results across Mandarin, Chinese dialects, and English benchmarks compared to existing solutions.

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao HuThu, 12 Ma⚡ eess

Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning

The paper proposes CSP-FT, a characteristic-specific partial fine-tuning strategy that selectively updates only the most and least relevant layers of LLM-based TTS models to achieve superior emotion and speaker adaptation with significantly faster training and reduced catastrophic forgetting compared to full fine-tuning.

Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu DangMon, 09 Ma💻 cs

HVAC-EAR: Eavesdropping Human Speech Using HVAC Systems

This paper introduces HVAC-EAR, a novel system that reconstructs intelligible human speech from low-resolution, noisy pressure data in HVAC systems using a complex-valued conformer, demonstrating significant privacy risks by achieving clear eavesdropping up to 1.2 meters in real-world deployments.

Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Anomadarshi BaruaMon, 09 Ma💻 cs

How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World?

This paper introduces ML-ITW, a large-scale multilingual dataset designed to evaluate speech deepfake detectors in real-world conditions, revealing that current detection methods suffer significant performance degradation when facing diverse languages and platform-specific compression artifacts.

Daixian Li, Jun Xue, Yanzhen Ren, Zhuolin Yi, Yihuan Huang, Guanxiang Feng, Yi ChaiMon, 09 Ma💻 cs

Which Data Matter? Embedding-Based Data Selection for Speech Recognition

This paper proposes an embedding-based data selection strategy that leverages speaker, phonetic, and semantic features to identify a small, high-quality subset of in-the-wild speech data, demonstrating that training on just 5% of the original 100k-hour dataset can significantly outperform full-dataset training for domain-specific speech recognition tasks.

Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel, Jie Chi, Zijin Gu, Takuya Higuchi, Jee-weon Jung, Shinji Watanabe, David Grangier, Barry-John Theobald, Tatiana LikhomanenkoMon, 09 Ma💻 cs

Continual Adaptation for Pacific Indigenous Speech Recognition

This paper presents an empirical study on adapting speech foundation models to low-resource Pacific Indigenous languages, revealing that while strategies like Low-Rank Adaptation offer initial success, they ultimately struggle with catastrophic forgetting and internal representational drift during sequential learning, highlighting the urgent need for robust adaptation frameworks that balance plasticity and stability.

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting DangMon, 09 Ma💬 cs.CL

LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

This paper proposes a compact acoustic framework that combines multi-branch CNN feature extraction with an efficient Legendre Memory Unit (LMU) for temporal modeling and a calibrated posterior ensemble fusion strategy to achieve robust, real-time cross-domain infant cry classification despite limited annotations and strong domain shifts.

Niloofar Jazaeri, Hilmi R. Dajani, Marco Janeczek, Martin BouchardMon, 09 Ma🤖 cs.LG

Koopman Regularized Deep Speech Disentanglement for Speaker Verification

This paper introduces the Deep Koopman Speech Disentanglement Autoencoder (DKSD-AE), a scalable and efficient architecture that leverages Koopman operators and instance normalization to effectively disentangle speaker identity from linguistic content for robust speaker verification without relying on textual supervision or large pretrained models.

Nikos Chazaridis, Mohammad Belal, Rafael Mestre, Timothy J. Norman, Christine EversMon, 09 Ma🤖 cs.LG

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

This paper introduces Whisper-RIR-Mega, a benchmark dataset pairing clean LibriSpeech utterances with real room impulse responses to evaluate and demonstrate the performance degradation of various Whisper models under reverberant conditions, while providing open-source resources for reproducible research on robust ASR.

Mandip GoswamiMon, 09 Ma🤖 cs.AI

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

This paper proposes a novel end-to-end audio-visual speech recognition framework that integrates speech enhancement via a Conformer-based bottleneck fusion module to implicitly refine noisy audio features without explicit mask generation, thereby preserving semantic integrity and outperforming existing mask-based methods on the LRS3 benchmark under noisy conditions.

Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei YinMon, 09 Ma🤖 cs.AI

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

RAMoEA-QA is a hierarchically routed generative model that employs a two-stage conditional specialization mechanism—combining an Audio Mixture-of-Experts for acoustic encoding and a Language Mixture-of-Adapters for query intent—to achieve state-of-the-art robustness and accuracy in respiratory audio question answering across diverse devices, environments, and task shifts.

Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia MascoloMon, 09 Ma🤖 cs.AI

Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input

This paper proposes a prosodic-boundary-aware post-training strategy for LLM-based TTS that enables natural streaming generation with incremental text input by learning early stopping at content boundaries and utilizing a sliding-window prompt to prevent long-form collapse, significantly outperforming existing baselines in both short and long-form scenarios.

Changsong Liu, Tianrui Wang, Ye Ni, Yizhou Peng, Eng Siong ChngMon, 09 Ma🤖 cs.AI

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Whisper-CD is a training-free, inference-time contrastive decoding framework that mitigates hallucinations and repetition in long-form speech recognition by contrasting clean audio logits against a unified objective derived from multiple acoustically motivated negative perturbations, thereby significantly reducing word error rates and improving generation throughput without requiring model retraining.

Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong ShimMon, 09 Ma🤖 cs.AI

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

This paper introduces RAPTOR, a controlled study demonstrating that multilingual HuBERT pre-training, rather than model scale, is the primary driver of cross-domain robustness and reliable calibration in compact audio deepfake detection systems.

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai DossMon, 09 Ma🤖 cs.AI

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

TempoSyncDiff is a reference-conditioned latent diffusion framework that employs teacher-student distillation and temporal regularization to enable low-latency, temporally stable, and identity-consistent audio-driven talking head generation suitable for edge deployment.

Soumya Mazumdar, Vineet Kumar RakeshMon, 09 Ma🤖 cs.AI

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

The paper introduces Omni-C, a single dense Transformer encoder that compresses heterogeneous modalities (text, audio, and image) into shared representations via unimodal contrastive pretraining, thereby eliminating the parameter overhead and routing complexity of Mixture-of-Expert architectures while achieving comparable performance with significantly reduced memory usage.

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmãoMon, 09 Ma🤖 cs.AI

Cough activity detection for automatic tuberculosis screening

This paper demonstrates that a lightweight configuration of the pre-trained XLS-R model, utilizing only its first three layers, achieves state-of-the-art cough activity detection for automatic tuberculosis screening, significantly outperforming existing baselines while offering the computational efficiency required for smartphone-based deployment.

Joshua Jansen van Vüren, Devendra Singh Parihar, Daphne Naidoo, Kimsey Zajac, Willy Ssengooba, Grant Theron, Thomas NieslerFri, 13 Ma⚡ eess

Can LLMs Help Localize Fake Words in Partially Fake Speech?

This paper investigates the use of a text-trained large language model adapted for speech to localize fake words in partially edited audio, revealing that while the model effectively identifies edits by leveraging specific training patterns like word-level polarity substitutions, it struggles to generalize to unseen editing styles.

Lin Zhang, Thomas Thebaud, Zexin Cai, Sanjeev Khudanpur, Daniel Povey, Leibny Paola García-Perera, Matthew Wiesner, Nicholas AndrewsFri, 13 Ma⚡ eess

← Previous Next →