cs.SD papers | Gist.Science

Training-Free Multi-Step Inference for Target Speaker Extraction

This paper proposes a training-free, multi-step inference method for target speaker extraction that iteratively refines speech estimates using a frozen pretrained model and introduces joint metric optimization to balance performance across intrusive and non-intrusive evaluation criteria.

Zhenghai You, Ying Shi, Lantian Li, Dong WangThu, 12 Ma💻 cs

VoxCare: Studying Natural Communication Behaviors of Hospital Caregivers through Wearable Sensing of Egocentric Audio

VoxCare is a scalable, privacy-preserving wearable system that uses on-device audio processing and speech foundation models to continuously analyze hospital caregivers' natural communication patterns, revealing how these behaviors reflect workload and stress to ultimately improve healthcare delivery.

Tiantian Feng, Kleanthis Avramidis, Anfeng Xu, Deqi Wang, Brandon M Booth, Shrikanth NarayananThu, 12 Ma💻 cs

OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

This paper introduces OSUM-Pangu, a fully open-source speech understanding foundation model built on the Ascend NPU platform using a non-CUDA stack, which achieves performance comparable to GPU-based models by integrating an audio encoder with the openPangu-7B LLM through a sequential training approach.

Yujie Liao, Xuelong Geng, Hongfei Xue, Shuiyuan Wang, Lei XieThu, 12 Ma💻 cs

Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

This paper proposes an efficient encoder-only multi-talker ASR framework that distills semantic priors from large language models into the encoder via a talker-aware teacher signal and utilizes a talker-count routing mechanism to achieve competitive performance with significantly lower inference latency compared to autoregressive LLM-based systems.

Hao Shi, Yusuke Fujita, Roman Koshkin, Mengjie Zhao, Yuan Gao, Lianbo Liu, Yui SudoThu, 12 Ma💻 cs

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

MoXaRt is a real-time Extended Reality system that leverages a cascaded audio-visual architecture to separate up to five concurrent sound sources, significantly improving speech intelligibility and reducing cognitive load in complex acoustic environments.

Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh KowdleThu, 12 Ma💻 cs

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

The paper introduces PRoADS, a provably secure and robust audio steganography framework that embeds secret messages into diffusion model noise via orthogonal projection and employs Latent Optimization with Backward Euler Inversion to minimize reconstruction errors, achieving a remarkably low bit error rate of 0.15% under 64 kbps MP3 compression.

YongPeng Yan, Yanan Li, Qiyang Xiao, Yanzhen RenThu, 12 Ma💻 cs

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

V2M-Zero introduces a zero-pair video-to-music generation framework that achieves superior temporal synchronization and semantic alignment by leveraging shared intra-modal temporal structures via event curves, eliminating the need for paired training data or cross-modal supervision.

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. BryanThu, 12 Ma🤖 cs.AI

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

This paper demonstrates that LoRA fine-tuning of compact LLM backbones significantly enhances voice cloning performance in terms of perceptual quality, speaker fidelity, and signal-to-noise ratio, provided the training data possesses sufficient acoustic diversity.

Anupam Purwar, Aditya ChoudharyThu, 12 Ma🤖 cs.AI

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

ID-LoRA is a novel framework that unifies audio and video personalization in a single generative pass by adapting the LTX-2 diffusion backbone with identity-driven In-Context LoRA and specialized techniques like negative temporal positions and identity guidance, achieving superior voice and visual likeness with minimal training data.

Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja GiryesThu, 12 Ma💻 cs

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

This paper evaluates the weak speaker verification capabilities of existing speech-aware large language models and proposes a lightweight augmentation method using frozen ECAPA-TDNN embeddings and LoRA adapters to significantly enhance speaker discrimination while preserving natural language interfaces.

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim DehakThu, 12 Ma🤖 cs.AI

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

This paper introduces HIR-SDD, a novel speech deepfake detection framework that leverages Large Audio Language Models and a human-annotated dataset to achieve robust generalization across audio domains while providing interpretable, human-like reasoning for its predictions.

Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov, Artem Iudin, Oleg Kiriukhin, Mikhail Pautov, Dmitrii Korzh, Oleg Y. RogovThu, 12 Ma🤖 cs.AI

Probabilistic Verification of Voice Anti-Spoofing Models

This paper introduces PV-VASM, a model-agnostic probabilistic framework that provides formal robustness guarantees and estimates misclassification probabilities for voice anti-spoofing models against various speech synthesis attacks and unseen perturbations.

Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh, Mikhail Pautov, Oleg Kiriukhin, Oleg Y. RogovThu, 12 Ma🤖 cs.AI

AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

AlphaFlowTSE is a one-step conditional generative model for target speaker extraction that utilizes a JVP-free AlphaFlow objective and interval-consistency training to achieve high-fidelity speech recovery with low latency and improved generalization for downstream ASR tasks.

Duojia Li, Shuhan Zhang, Zihan Qian, Wenxuan Wu, Shuai Wang, Qingyang Hong, Lin Li, Haizhou LiThu, 12 Ma🤖 cs.AI

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

NasoVoce is a nose-mounted speech interface that fuses acoustic and vibration signals to enable robust, discreet, and always-available voice interaction for AI, effectively overcoming the limitations of existing silent and whispered speech recognition methods.

Jun Rekimoto, Yu Nishimura, Bojian YangThu, 12 Ma🤖 cs.AI

Are Deep Speech Denoising Models Robust to Adversarial Noise?

This paper demonstrates that four recent deep speech denoising models are vulnerable to psychoacoustically hidden adversarial noise, which can render their output unintelligible while remaining imperceptible to human listeners, thereby highlighting critical safety concerns for their deployment in high-stakes applications.

Will Schwarzer, Neel Chaudhari, Philip S. Thomas, Andrea Fanelli, Xiaoyu LiuThu, 12 Ma⚡ eess

HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

The paper proposes HyWA, a novel Personalized Voice Activity Detection (PVAD) approach that utilizes a hypernetwork to generate personalized weights for selected layers of a standard VAD model, demonstrating consistent performance improvements and enhanced deployment flexibility compared to existing speaker-conditioning methods.

Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi NiaThu, 12 Ma⚡ eess

Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

This paper proposes a robust Audio-Visual Target Speaker Extraction framework that leverages emotion-aware multiple enrollment fusion, demonstrating that training with high modality missing rates significantly enhances performance stability against real-world signal loss while achieving optimal results by fusing single-frame facial images with frame-level lip features.

Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Wei Ju, Yao Tian, Juan Liu, Ming LiThu, 12 Ma⚡ eess

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

The paper proposes AMB-DSGDN, a novel network for multimodal emotion recognition that utilizes modality-specific semantic graphs with a differential attention mechanism to filter noise and an adaptive balancing strategy to prevent dominant modalities from suppressing complementary cues, thereby enhancing the accuracy of dynamic emotional state modeling.

Yunsheng Wang, Yuntao Shou, Yilong Tan, Wei Ai, Tao Meng, Keqin LiThu, 12 Ma🤖 cs.AI

nlm: Real-Time Non-linear Modal Synthesis in Max

This paper introduces \texttt{nlm}, an open-source set of C++ Max externals that enables efficient, real-time non-linear modal synthesis for strings, membranes, and plates, thereby making advanced physical modeling accessible to composers and sound designers through interactive parameter control and multichannel output.

Rodrigo Diaz, Rodrigo Constanzo, Mark SandlerThu, 12 Ma⚡ eess

Trade-offs between structural richness and communication efficiency in music network representations

This study demonstrates that the choice of musical feature encoding fundamentally reshapes network topology and uncertainty distributions, revealing a critical trade-off where compressed single-feature representations offer high descriptive accuracy with lower model error, while richer multi-feature encodings preserve finer distinctions at the cost of increased state space complexity and higher model error.

Lluc Bono Rosselló, Robert Jankowski, Hugues Bersini, Marián Boguñá, M. Ángeles SerranoThu, 12 Ma🧬 q-bio

← Previous Next →