HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

The paper proposes HyWA, a novel Personalized Voice Activity Detection (PVAD) approach that utilizes a hypernetwork to generate personalized weights for selected layers of a standard VAD model, demonstrating consistent performance improvements and enhanced deployment flexibility compared to existing speaker-conditioning methods.

Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi NiaThu, 12 Ma⚡ eess

Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

This paper proposes a robust Audio-Visual Target Speaker Extraction framework that leverages emotion-aware multiple enrollment fusion, demonstrating that training with high modality missing rates significantly enhances performance stability against real-world signal loss while achieving optimal results by fusing single-frame facial images with frame-level lip features.

Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Wei Ju, Yao Tian, Juan Liu, Ming LiThu, 12 Ma⚡ eess

Trade-offs between structural richness and communication efficiency in music network representations

This study demonstrates that the choice of musical feature encoding fundamentally reshapes network topology and uncertainty distributions, revealing a critical trade-off where compressed single-feature representations offer high descriptive accuracy with lower model error, while richer multi-feature encodings preserve finer distinctions at the cost of increased state space complexity and higher model error.

Lluc Bono Rosselló, Robert Jankowski, Hugues Bersini, Marián Boguñá, M. Ángeles SerranoThu, 12 Ma🧬 q-bio

MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

This paper reveals a systematic gender bias in speech quality assessment where male listeners consistently rate audio higher than female listeners, particularly for low-quality speech, and proposes a gender-aware model that learns distinct scoring patterns to improve prediction accuracy and equity.

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Erica Cooper, Ryandhimas E. Zezario, Hsin-Min Wang, Hung-yi Lee, Yu TsaoThu, 12 Ma⚡ eess

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

This paper introduces Geo-ATBench, a new benchmark and the Geo-AT task that leverage geospatial semantic context to resolve acoustic ambiguities in multi-label audio tagging, demonstrating through the GeoFusion-AT framework that incorporating location-based priors significantly improves recognition performance and aligns with human judgment.

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, Dick BotteldoorenThu, 12 Ma⚡ eess

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

The paper introduces FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition system that unifies high-performance modules for speech transcription, voice activity detection, language identification, and punctuation prediction, achieving superior results across Mandarin, Chinese dialects, and English benchmarks compared to existing solutions.

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao HuThu, 12 Ma⚡ eess

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

This paper introduces ParaS2S, a reinforcement learning framework and corresponding benchmark (ParaS2SBench) that utilizes a novel PolyTone-trained automatic judge to effectively align speech-to-speech models with paralinguistic cues, achieving superior performance in response content and speaking style compared to supervised fine-tuning while requiring fewer paired demonstrations.

Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, Yonghui WuMon, 09 Ma⚡ eess

The trajectoRIR Database: Room Acoustic Recordings Along a Trajectory of Moving Microphones

This paper introduces the trajectoRIR database, a comprehensive collection of 8,648 stationary room impulse responses and dynamic audio recordings captured by various microphone arrays moving along a controlled L-shaped trajectory, designed to support diverse acoustic signal processing tasks such as source localization, spatial reconstruction, and system identification.

Stefano Damiano, Kathleen MacWilliam, Valerio Lorenzoni, Thomas Dietzen, Toon van WaterschootMon, 09 Ma⚡ eess

Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction

This paper presents a competitive open-source cascaded system that combines EEND-VC speaker diarization and fine-tuned Qwen3 ASR to achieve first place in the DISPLACE-M challenge by effectively extracting medical conditions from overlapping, code-switched Hinglish clinical dialogues.

Séverin Baroudi, Yanis Labrak, Shashi Kumar, Joonas Kalda, Sergio Burdisso, Pawel Cyrta, Juan Ignacio Alvarez-Trejos, Petr Motlicek, Hervé Bredin, Ricard MarxerMon, 09 Ma⚡ eess

Cross-linguistic Prosodic Analysis of Autistic and Non-autistic Child Speech in Finnish, French and Slovak

This study analyzes a multilingual corpus of Finnish, French, and Slovak child speech to demonstrate that autistic speakers exhibit a distinct, cross-linguistic prosodic profile characterized by increased intensity variability, clearer voice quality, and reduced temporal dynamics, thereby challenging deficiency-based models in favor of a complex, language-independent acoustic signature.

Ida-Lotta Myllylä, Sofoklis KakourosMon, 09 Ma⚡ eess

Classification of Autistic and Non-Autistic Children's Speech: A Cross-Linguistic Study in Finnish, French, and Slovak

This cross-linguistic study demonstrates that while certain acoustic-prosodic markers of autism in children's speech generalize across Finnish, French, and Slovak, robust classification performance requires language-specific modeling due to significant variations in feature importance and transferability across typologically distinct languages.

Sofoklis Kakouros, Ida-Lotta MyllyläMon, 09 Ma⚡ eess

Activation Steering for Accent Adaptation in Speech Foundation Models

This paper proposes a parameter-free activation steering method that identifies accent information within a specific band of middle encoder layers in speech foundation models and corrects accent-induced representation shifts during inference, thereby significantly reducing word error rates across diverse accents without requiring model fine-tuning.

Jinuo Sun, Yang Xiao, Sung Kyun Chung, Qiuchi Hu, Gongping Huang, Eun-Jung Holden, Ting DangMon, 09 Ma⚡ eess

Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning

The paper proposes CSP-FT, a characteristic-specific partial fine-tuning strategy that selectively updates only the most and least relevant layers of LLM-based TTS models to achieve superior emotion and speaker adaptation with significantly faster training and reduced catastrophic forgetting compared to full fine-tuning.

Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu DangMon, 09 Ma💻 cs