eess.AS papers | Gist.Science

Continual Adaptation for Pacific Indigenous Speech Recognition

This paper presents an empirical study on adapting speech foundation models to low-resource Pacific Indigenous languages, revealing that while strategies like Low-Rank Adaptation offer initial success, they ultimately struggle with catastrophic forgetting and internal representational drift during sequential learning, highlighting the urgent need for robust adaptation frameworks that balance plasticity and stability.

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting DangMon, 09 Ma💬 cs.CL

LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

This paper proposes a compact acoustic framework that combines multi-branch CNN feature extraction with an efficient Legendre Memory Unit (LMU) for temporal modeling and a calibrated posterior ensemble fusion strategy to achieve robust, real-time cross-domain infant cry classification despite limited annotations and strong domain shifts.

Niloofar Jazaeri, Hilmi R. Dajani, Marco Janeczek, Martin BouchardMon, 09 Ma🤖 cs.LG

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

This paper introduces Whisper-RIR-Mega, a benchmark dataset pairing clean LibriSpeech utterances with real room impulse responses to evaluate and demonstrate the performance degradation of various Whisper models under reverberant conditions, while providing open-source resources for reproducible research on robust ASR.

Mandip GoswamiMon, 09 Ma🤖 cs.AI

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR $\rightarrow$ LLM Pipelines?

This paper challenges the assumption that Speech LLMs inherently outperform ASR $\rightarrow$ LLM pipelines by demonstrating through matched-backbone testing and mechanistic analysis that current Speech LLMs often function as expensive cascades relying on text representations, which can even underperform traditional pipelines under noisy conditions.

Jayadev BillaMon, 09 Ma🤖 cs.AI

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

This paper proposes a novel end-to-end audio-visual speech recognition framework that integrates speech enhancement via a Conformer-based bottleneck fusion module to implicitly refine noisy audio features without explicit mask generation, thereby preserving semantic integrity and outperforming existing mask-based methods on the LRS3 benchmark under noisy conditions.

Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei YinMon, 09 Ma🤖 cs.AI

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Whisper-CD is a training-free, inference-time contrastive decoding framework that mitigates hallucinations and repetition in long-form speech recognition by contrasting clean audio logits against a unified objective derived from multiple acoustically motivated negative perturbations, thereby significantly reducing word error rates and improving generation throughput without requiring model retraining.

Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong ShimMon, 09 Ma🤖 cs.AI

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

StreamVoiceAnon+ is a streaming speaker anonymization system that preserves emotional content by combining supervised finetuning with neutral-emotion pairs and frame-level acoustic distillation, achieving significant improvements in emotion preservation (49.2% UAR) and intelligibility (5.77% WER) while maintaining strong privacy and zero inference latency overhead.

Nikita Kuzmin, Kong Aik Lee, Eng Siong ChngMon, 09 Ma🤖 cs.AI

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

This paper introduces JHCodec, a low-latency streaming neural audio codec that utilizes a self-supervised representation reconstruction (SSRR) loss to achieve state-of-the-art intelligibility and convergence speed without requiring additional lookahead or semantic encoder distillation.

Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim DehakMon, 09 Ma🤖 cs.AI

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

The paper introduces Omni-C, a single dense Transformer encoder that compresses heterogeneous modalities (text, audio, and image) into shared representations via unimodal contrastive pretraining, thereby eliminating the parameter overhead and routing complexity of Mixture-of-Expert architectures while achieving comparable performance with significantly reduced memory usage.

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmãoMon, 09 Ma🤖 cs.AI

Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts

This paper proposes a self-speculative decoding framework that leverages a CTC encoder as a draft model to simultaneously accelerate auto-regressive inference and improve word error rates in speech recognition, achieving a 4.4x speedup with minimal accuracy loss on the HuggingFace Open ASR benchmark.

George Saon, Samuel Thomas, Takashi Fukuda, Tohru Nagano, Avihu Dekel, Luis LastrasFri, 13 Ma⚡ eess

Cough activity detection for automatic tuberculosis screening

This paper demonstrates that a lightweight configuration of the pre-trained XLS-R model, utilizing only its first three layers, achieves state-of-the-art cough activity detection for automatic tuberculosis screening, significantly outperforming existing baselines while offering the computational efficiency required for smartphone-based deployment.

Joshua Jansen van Vüren, Devendra Singh Parihar, Daphne Naidoo, Kimsey Zajac, Willy Ssengooba, Grant Theron, Thomas NieslerFri, 13 Ma⚡ eess

Can LLMs Help Localize Fake Words in Partially Fake Speech?

This paper investigates the use of a text-trained large language model adapted for speech to localize fake words in partially edited audio, revealing that while the model effectively identifies edits by leveraging specific training patterns like word-level polarity substitutions, it struggles to generalize to unseen editing styles.

Lin Zhang, Thomas Thebaud, Zexin Cai, Sanjeev Khudanpur, Daniel Povey, Leibny Paola García-Perera, Matthew Wiesner, Nicholas AndrewsFri, 13 Ma⚡ eess

Controllable Dance Generation with Style-Guided Motion Diffusion

This paper proposes Style-Guided Motion Diffusion (SGMD), a novel framework that integrates a Transformer-based architecture with a Style Modulation module and spatial-temporal masking to generate realistic, music-aligned dance sequences that are both stylistically consistent and flexibly controllable for tasks like trajectory generation, in-betweening, and inpainting.

Hongsong Wang, Ying Zhu, Xin Geng + 1 more2026-03-11⚡ eess

Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer

This study demonstrates that strong correlations exist between subjective perceptual ratings and objective acoustic measures in head and neck cancer patients, suggesting that a single intelligibility measure may be sufficient for clinical monitoring of speech following chemoradiation treatment.

Bence Mark Halpern, Thomas Tienkamp, Teja Rebernik + 4 more2026-03-10⚡ eess

ExSampling: a system for the real-time ensemble performance of field-recorded environmental sounds

The paper proposes ExSampling, an integrated system combining a recording application and a Deep Learning environment to enable the real-time ensemble performance of field-recorded environmental sounds through automated sound mapping to Ableton Live tracks.

Atsuya Kobayashi, Reo Anzai, Nao Tokui2026-03-10⚡ eess

BabAR: from phoneme recognition to developmental measures of young children's speech production

The paper introduces BabAR, a cross-linguistic automatic phoneme recognition system trained on the newly curated TinyVox corpus of over half a million child vocalizations, which effectively supports large-scale developmental speech analysis by demonstrating that multilingual pretraining and contextual fine-tuning yield accurate measures of speech maturity.

Marvin Lavechin, Elika Bergelson, Roger Levy2026-03-06⚡ eess

Voice Timbre Attribute Detection with Compact and Interpretable Training-Free Acoustic Parameters

This paper introduces a compact, training-free, and interpretable set of acoustic parameters for voice timbre attribute detection that achieves competitive performance against deep learning models while offering explicit physical insights into timbre perception.

Aemon Yat Fei Chiu, Yujia Xiao, Qiuqiang Kong + 1 more2026-03-06⚡ eess

The PARLO Dementia Corpus: A German Multi-Center Resource for Alzheimer's Disease

This paper introduces the PARLO Dementia Corpus (PDC), the first publicly available, multi-center German dataset featuring speech recordings, transcriptions, and clinical metadata from Alzheimer's patients and healthy controls, which enables scalable, non-invasive speech-based detection and assessment of cognitive impairment.

Franziska Braun, Christopher Witzl, Florian Hönig + 3 more2026-03-06⚡ eess

Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge

This paper presents the DISPLACE-M challenge, a benchmark for conversational AI in noisy, multi-speaker medical settings, by releasing a dataset of frontline health worker interactions and evaluating baseline systems across speaker diarization, speech recognition, topic identification, and dialogue summarization tasks.

Dhanya E, Ankita Meena, Manas Nanivadekar + 11 more2026-03-06⚡ eess

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

The paper proposes MSpoof-TTS, a training-free inference framework that enhances zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding strategy to prune artifacts and improve perceptual realism without retraining.

Junchuan Zhao, Minh Duc Vu, Ye Wang2026-03-06💻 cs

← Previous Next →

eess.AS