eess.AS papers | Gist.Science

Multi-View Based Audio Visual Target Speaker Extraction

This paper proposes Multi-View Tensor Fusion (MVTF), a novel framework that leverages synchronized multi-perspective lip videos during training to learn cross-view correlations, thereby significantly enhancing target speaker extraction performance and robustness for both single-view and multi-view inference scenarios.

Peijun Yang, Zhan Jin, Juan Liu, Ming LiThu, 12 Ma⚡ eess

HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

The paper proposes HyWA, a novel Personalized Voice Activity Detection (PVAD) approach that utilizes a hypernetwork to generate personalized weights for selected layers of a standard VAD model, demonstrating consistent performance improvements and enhanced deployment flexibility compared to existing speaker-conditioning methods.

Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi NiaThu, 12 Ma⚡ eess

Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

This paper proposes a robust Audio-Visual Target Speaker Extraction framework that leverages emotion-aware multiple enrollment fusion, demonstrating that training with high modality missing rates significantly enhances performance stability against real-world signal loss while achieving optimal results by fusing single-frame facial images with frame-level lip features.

Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Wei Ju, Yao Tian, Juan Liu, Ming LiThu, 12 Ma⚡ eess

nlm: Real-Time Non-linear Modal Synthesis in Max

This paper introduces \texttt{nlm}, an open-source set of C++ Max externals that enables efficient, real-time non-linear modal synthesis for strings, membranes, and plates, thereby making advanced physical modeling accessible to composers and sound designers through interactive parameter control and multichannel output.

Rodrigo Diaz, Rodrigo Constanzo, Mark SandlerThu, 12 Ma⚡ eess

Trade-offs between structural richness and communication efficiency in music network representations

This study demonstrates that the choice of musical feature encoding fundamentally reshapes network topology and uncertainty distributions, revealing a critical trade-off where compressed single-feature representations offer high descriptive accuracy with lower model error, while richer multi-feature encodings preserve finer distinctions at the cost of increased state space complexity and higher model error.

Lluc Bono Rosselló, Robert Jankowski, Hugues Bersini, Marián Boguñá, M. Ángeles SerranoThu, 12 Ma🧬 q-bio

MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

This paper reveals a systematic gender bias in speech quality assessment where male listeners consistently rate audio higher than female listeners, particularly for low-quality speech, and proposes a gender-aware model that learns distinct scoring patterns to improve prediction accuracy and equity.

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Erica Cooper, Ryandhimas E. Zezario, Hsin-Min Wang, Hung-yi Lee, Yu TsaoThu, 12 Ma⚡ eess

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

This paper introduces Geo-ATBench, a new benchmark and the Geo-AT task that leverage geospatial semantic context to resolve acoustic ambiguities in multi-label audio tagging, demonstrating through the GeoFusion-AT framework that incorporating location-based priors significantly improves recognition performance and aligns with human judgment.

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, Dick BotteldoorenThu, 12 Ma⚡ eess

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

The paper proposes G-STAR, an end-to-end system that integrates a time-aware speaker-tracking module with a Speech-LLM backbone to achieve robust, timestamped speaker-attributed recognition for long-form, overlapping multi-party speech while maintaining global identity consistency.

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai WangThu, 12 Ma⚡ eess

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

The paper introduces FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition system that unifies high-performance modules for speech transcription, voice activity detection, language identification, and punctuation prediction, achieving superior results across Mandarin, Chinese dialects, and English benchmarks compared to existing solutions.

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao HuThu, 12 Ma⚡ eess

Speech Codec Probing from Semantic and Phonetic Perspectives

This paper systematically analyzes widely used speech tokenizers and reveals that they primarily encode phonetic rather than lexical-semantic information, highlighting a critical mismatch with text-derived semantics that necessitates new design approaches for effective multimodal LLM integration.

Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth NarayananThu, 12 Ma⚡ eess

Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

This paper introduces a calibration-reasoning framework that fine-tunes foundational Audio Large Language Models through a calibration stage and Group Relative Policy Optimization-based reinforcement learning to achieve state-of-the-art performance in multidimensional speech quality assessment, artifact localization, and MOS prediction.

Elizaveta Kostenok, Mathieu Salzmann, Milos CernakThu, 12 Ma⚡ eess

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

This paper introduces ParaS2S, a reinforcement learning framework and corresponding benchmark (ParaS2SBench) that utilizes a novel PolyTone-trained automatic judge to effectively align speech-to-speech models with paralinguistic cues, achieving superior performance in response content and speaking style compared to supervised fine-tuning while requiring fewer paired demonstrations.

Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, Yonghui WuMon, 09 Ma⚡ eess

The trajectoRIR Database: Room Acoustic Recordings Along a Trajectory of Moving Microphones

This paper introduces the trajectoRIR database, a comprehensive collection of 8,648 stationary room impulse responses and dynamic audio recordings captured by various microphone arrays moving along a controlled L-shaped trajectory, designed to support diverse acoustic signal processing tasks such as source localization, spatial reconstruction, and system identification.

Stefano Damiano, Kathleen MacWilliam, Valerio Lorenzoni, Thomas Dietzen, Toon van WaterschootMon, 09 Ma⚡ eess

Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction

This paper presents a competitive open-source cascaded system that combines EEND-VC speaker diarization and fine-tuned Qwen3 ASR to achieve first place in the DISPLACE-M challenge by effectively extracting medical conditions from overlapping, code-switched Hinglish clinical dialogues.

Séverin Baroudi, Yanis Labrak, Shashi Kumar, Joonas Kalda, Sergio Burdisso, Pawel Cyrta, Juan Ignacio Alvarez-Trejos, Petr Motlicek, Hervé Bredin, Ricard MarxerMon, 09 Ma⚡ eess

Cross-linguistic Prosodic Analysis of Autistic and Non-autistic Child Speech in Finnish, French and Slovak

This study analyzes a multilingual corpus of Finnish, French, and Slovak child speech to demonstrate that autistic speakers exhibit a distinct, cross-linguistic prosodic profile characterized by increased intensity variability, clearer voice quality, and reduced temporal dynamics, thereby challenging deficiency-based models in favor of a complex, language-independent acoustic signature.

Ida-Lotta Myllylä, Sofoklis KakourosMon, 09 Ma⚡ eess

Classification of Autistic and Non-Autistic Children's Speech: A Cross-Linguistic Study in Finnish, French, and Slovak

This cross-linguistic study demonstrates that while certain acoustic-prosodic markers of autism in children's speech generalize across Finnish, French, and Slovak, robust classification performance requires language-specific modeling due to significant variations in feature importance and transferability across typologically distinct languages.

Sofoklis Kakouros, Ida-Lotta MyllyläMon, 09 Ma⚡ eess

Activation Steering for Accent-Neutralized Zero-Shot Text-To-Speech

This paper introduces a training-free, post-hoc method called activation steering that neutralizes accents in zero-shot Text-to-Speech while preserving speaker timbre by applying offline-extracted steering vectors during inference.

Mu Yang, John H. L. HansenMon, 09 Ma⚡ eess

ImKWS: Test-Time Adaptation for Keyword Spotting with Class Imbalance

ImKWS is a novel test-time adaptation method for keyword spotting that addresses severe class imbalance between rare keywords and background noise by employing a dual-branch entropy minimization strategy with separate update strengths and multi-transformation consistency, thereby preventing model overconfidence and bias without requiring labeled data.

Hanyu Ding, Yang Xiao, Jiaheng Dong, Ting DangMon, 09 Ma⚡ eess

Activation Steering for Accent Adaptation in Speech Foundation Models

This paper proposes a parameter-free activation steering method that identifies accent information within a specific band of middle encoder layers in speech foundation models and corrects accent-induced representation shifts during inference, thereby significantly reducing word error rates across diverse accents without requiring model fine-tuning.

Jinuo Sun, Yang Xiao, Sung Kyun Chung, Qiuchi Hu, Gongping Huang, Eun-Jung Holden, Ting DangMon, 09 Ma⚡ eess

Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning

The paper proposes CSP-FT, a characteristic-specific partial fine-tuning strategy that selectively updates only the most and least relevant layers of LLM-based TTS models to achieve superior emotion and speaker adaptation with significantly faster training and reduced catastrophic forgetting compared to full fine-tuning.

Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu DangMon, 09 Ma💻 cs

← Previous Next →