Continual Adaptation for Pacific Indigenous Speech Recognition

This paper presents an empirical study on adapting speech foundation models to low-resource Pacific Indigenous languages, revealing that while strategies like Low-Rank Adaptation offer initial success, they ultimately struggle with catastrophic forgetting and internal representational drift during sequential learning, highlighting the urgent need for robust adaptation frameworks that balance plasticity and stability.

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting DangMon, 09 Ma💬 cs.CL

LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

This paper proposes a compact acoustic framework that combines multi-branch CNN feature extraction with an efficient Legendre Memory Unit (LMU) for temporal modeling and a calibrated posterior ensemble fusion strategy to achieve robust, real-time cross-domain infant cry classification despite limited annotations and strong domain shifts.

Niloofar Jazaeri, Hilmi R. Dajani, Marco Janeczek, Martin BouchardMon, 09 Ma🤖 cs.LG

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

This paper proposes a novel end-to-end audio-visual speech recognition framework that integrates speech enhancement via a Conformer-based bottleneck fusion module to implicitly refine noisy audio features without explicit mask generation, thereby preserving semantic integrity and outperforming existing mask-based methods on the LRS3 benchmark under noisy conditions.

Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei YinMon, 09 Ma🤖 cs.AI

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Whisper-CD is a training-free, inference-time contrastive decoding framework that mitigates hallucinations and repetition in long-form speech recognition by contrasting clean audio logits against a unified objective derived from multiple acoustically motivated negative perturbations, thereby significantly reducing word error rates and improving generation throughput without requiring model retraining.

Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong ShimMon, 09 Ma🤖 cs.AI

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

StreamVoiceAnon+ is a streaming speaker anonymization system that preserves emotional content by combining supervised finetuning with neutral-emotion pairs and frame-level acoustic distillation, achieving significant improvements in emotion preservation (49.2% UAR) and intelligibility (5.77% WER) while maintaining strong privacy and zero inference latency overhead.

Nikita Kuzmin, Kong Aik Lee, Eng Siong ChngMon, 09 Ma🤖 cs.AI

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

This paper introduces JHCodec, a low-latency streaming neural audio codec that utilizes a self-supervised representation reconstruction (SSRR) loss to achieve state-of-the-art intelligibility and convergence speed without requiring additional lookahead or semantic encoder distillation.

Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim DehakMon, 09 Ma🤖 cs.AI

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

The paper introduces Omni-C, a single dense Transformer encoder that compresses heterogeneous modalities (text, audio, and image) into shared representations via unimodal contrastive pretraining, thereby eliminating the parameter overhead and routing complexity of Mixture-of-Expert architectures while achieving comparable performance with significantly reduced memory usage.

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmãoMon, 09 Ma🤖 cs.AI

Cough activity detection for automatic tuberculosis screening

This paper demonstrates that a lightweight configuration of the pre-trained XLS-R model, utilizing only its first three layers, achieves state-of-the-art cough activity detection for automatic tuberculosis screening, significantly outperforming existing baselines while offering the computational efficiency required for smartphone-based deployment.

Joshua Jansen van Vüren, Devendra Singh Parihar, Daphne Naidoo, Kimsey Zajac, Willy Ssengooba, Grant Theron, Thomas NieslerFri, 13 Ma⚡ eess

Can LLMs Help Localize Fake Words in Partially Fake Speech?

This paper investigates the use of a text-trained large language model adapted for speech to localize fake words in partially edited audio, revealing that while the model effectively identifies edits by leveraging specific training patterns like word-level polarity substitutions, it struggles to generalize to unseen editing styles.

Lin Zhang, Thomas Thebaud, Zexin Cai, Sanjeev Khudanpur, Daniel Povey, Leibny Paola García-Perera, Matthew Wiesner, Nicholas AndrewsFri, 13 Ma⚡ eess

BabAR: from phoneme recognition to developmental measures of young children's speech production

The paper introduces BabAR, a cross-linguistic automatic phoneme recognition system trained on the newly curated TinyVox corpus of over half a million child vocalizations, which effectively supports large-scale developmental speech analysis by demonstrating that multilingual pretraining and contextual fine-tuning yield accurate measures of speech maturity.

Marvin Lavechin, Elika Bergelson, Roger Levy2026-03-06⚡ eess