VoxCare: Studying Natural Communication Behaviors of Hospital Caregivers through Wearable Sensing of Egocentric Audio

VoxCare is a scalable, privacy-preserving wearable system that uses on-device audio processing and speech foundation models to continuously analyze hospital caregivers' natural communication patterns, revealing how these behaviors reflect workload and stress to ultimately improve healthcare delivery.

Tiantian Feng, Kleanthis Avramidis, Anfeng Xu, Deqi Wang, Brandon M Booth, Shrikanth NarayananThu, 12 Ma💻 cs

Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

This paper proposes an efficient encoder-only multi-talker ASR framework that distills semantic priors from large language models into the encoder via a talker-aware teacher signal and utilizes a talker-count routing mechanism to achieve competitive performance with significantly lower inference latency compared to autoregressive LLM-based systems.

Hao Shi, Yusuke Fujita, Roman Koshkin, Mengjie Zhao, Yuan Gao, Lianbo Liu, Yui SudoThu, 12 Ma💻 cs

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

The paper introduces PRoADS, a provably secure and robust audio steganography framework that embeds secret messages into diffusion model noise via orthogonal projection and employs Latent Optimization with Backward Euler Inversion to minimize reconstruction errors, achieving a remarkably low bit error rate of 0.15% under 64 kbps MP3 compression.

YongPeng Yan, Yanan Li, Qiyang Xiao, Yanzhen RenThu, 12 Ma💻 cs

HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

The paper proposes HyWA, a novel Personalized Voice Activity Detection (PVAD) approach that utilizes a hypernetwork to generate personalized weights for selected layers of a standard VAD model, demonstrating consistent performance improvements and enhanced deployment flexibility compared to existing speaker-conditioning methods.

Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi NiaThu, 12 Ma⚡ eess

Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

This paper proposes a robust Audio-Visual Target Speaker Extraction framework that leverages emotion-aware multiple enrollment fusion, demonstrating that training with high modality missing rates significantly enhances performance stability against real-world signal loss while achieving optimal results by fusing single-frame facial images with frame-level lip features.

Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Wei Ju, Yao Tian, Juan Liu, Ming LiThu, 12 Ma⚡ eess

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

The paper proposes AMB-DSGDN, a novel network for multimodal emotion recognition that utilizes modality-specific semantic graphs with a differential attention mechanism to filter noise and an adaptive balancing strategy to prevent dominant modalities from suppressing complementary cues, thereby enhancing the accuracy of dynamic emotional state modeling.

Yunsheng Wang, Yuntao Shou, Yilong Tan, Wei Ai, Tao Meng, Keqin LiThu, 12 Ma🤖 cs.AI

Trade-offs between structural richness and communication efficiency in music network representations

This study demonstrates that the choice of musical feature encoding fundamentally reshapes network topology and uncertainty distributions, revealing a critical trade-off where compressed single-feature representations offer high descriptive accuracy with lower model error, while richer multi-feature encodings preserve finer distinctions at the cost of increased state space complexity and higher model error.

Lluc Bono Rosselló, Robert Jankowski, Hugues Bersini, Marián Boguñá, M. Ángeles SerranoThu, 12 Ma🧬 q-bio