cs.SD papers | Gist.Science

TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control

This paper introduces TimberAgent, a Gram-guided retrieval system using Texture Resonance Retrieval (TRR) to bridge the semantic gap between user intent and low-level audio effect parameters, demonstrating superior performance in predicting editable plugin configurations through rigorous benchmarking and perceptual evaluation.

Shihao He, Yihan Xia, Fang Liu, Taotao Wang, Shengli ZhangWed, 11 Ma🤖 cs.AI

Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

This paper analyzes gender bias in audio deepfake detection using the ASVspoof 5 dataset and a ResNet-18 classifier, demonstrating that while aggregate metrics like Equal Error Rate may suggest low disparity, fairness-aware evaluation reveals significant gender-specific error distributions that necessitate more equitable and robust detection systems.

Aishwarya Fursule, Shruti Kshirsagar, Anderson R. AvilaWed, 11 Ma🤖 cs.AI

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

The paper introduces VoxEmo, a comprehensive benchmark and toolkit for evaluating Speech Large Language Models on speech emotion recognition across 35 corpora and 15 languages, featuring a distribution-aware soft-label protocol that reveals how these models uniquely align with human subjective emotion distributions despite trailing supervised baselines in hard-label accuracy.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas HainWed, 11 Ma🤖 cs.AI

Fish Audio S2 Technical Report

This paper introduces Fish Audio S2, an open-source text-to-speech system that leverages a multi-stage training pipeline to enable multi-speaker, multi-turn generation with natural-language instruction following, while providing production-ready weights and an efficient SGLang-based inference engine.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei HanWed, 11 Ma🤖 cs.AI

EDMFormer: Genre-Specific Self-Supervised Learning for Music Structure Segmentation

The paper introduces EDMFormer, a transformer model trained on a newly released dataset of 98 professionally annotated EDM tracks (EDM-98) to address the limitations of existing music segmentation methods by leveraging genre-specific energy, rhythm, and timbre features for improved structure detection in Electronic Dance Music.

Sahal Sajeer, Krish Patel, Oscar Chung, Joel Song BaeWed, 11 Ma🤖 cs.AI

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

This paper introduces LibriTTS-VI, the first public corpus for numerical voice impression control, and proposes novel training methods to mitigate impression leakage, achieving significantly improved controllability over existing prompt-based approaches.

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki KumakuraTue, 10 Ma💻 cs

SUBARU: A Practical Approach to Power Saving in Hearables Using SUB-Nyquist Audio Resolution Upsampling

The paper proposes SUBARU, a power-efficient framework for hearables that intentionally employs sub-Nyquist sampling and low bit-resolution ADCs to achieve a 3.31x reduction in power consumption while maintaining high-quality multimodal speech enhancement through a novel wideband reconstruction methodology.

Tarikul Islam Tamiti, Sajid Fardin Dipto, Luke Benjamin Baja-Ricketts, David C Vergano, Anomadarshi BaruaTue, 10 Ma💻 cs

WaLi: Can Pressure Sensors in HVAC Systems Capture Human Speech?

This paper introduces WaLi, a novel attack framework that leverages standard HVAC pressure sensors to reconstruct intelligible human speech from noisy, low-resolution data using a complex-valued conformer and Complex Global Attention Block, thereby revealing a significant, previously unaddressed privacy vulnerability in building infrastructure.

Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Anomadarshi BaruaTue, 10 Ma💻 cs

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

ExpGest is a novel diffusion-based framework that generates expressive, controllable full-body gestures by leveraging synchronized audio and text guidance, along with a specialized noise emotion classifier, to overcome the limitations of existing methods that often produce stiff, upper-body-only movements.

Yongkang Cheng, Mingjiang Liang, Shaoli Huang, Gaoge Han, Jifeng Ning, Wei LiuTue, 10 Ma💻 cs

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

This paper introduces the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel framework that integrates audio keyword spotting with diffusion-refined video segmentation to automatically extract key metrics and features for the objective assisted diagnosis of Vocal Fold Paralysis.

Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Juan Liu, Faya Liang, Ming LiTue, 10 Ma💻 cs

Scalable Neural Vocoder from Range-Null Space Decomposition

This paper proposes RNDVoC, a scalable and lightweight neural vocoder that bridges classical range-null space decomposition with deep learning to achieve state-of-the-art performance while addressing challenges in model transparency, retraining flexibility, and parameter efficiency.

Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Xiaodong Li, Dong Yu, Chengshi ZhengTue, 10 Ma💻 cs

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

This paper introduces a systematic framework for Large Audio-Language Models that reformulates ambiguous emotion recognition as a distributional reasoning problem, utilizing an ambiguity-aware objective and structured chain-of-thought supervision to significantly improve performance on standard benchmarks.

Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting DangTue, 10 Ma💻 cs

Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models

This paper introduces ESC, an Evolution Strategy-based calibration method that addresses the unique challenges of audio signal quantization by optimizing activation scaling, thereby achieving near-lossless performance for INT4 and full INT8 quantization across multiple speech tasks.

Lucas RakotoarivonyTue, 10 Ma💻 cs

Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds

This paper proposes a novel spectrogram-based Convolutional Neural Network (CNN) approach for multilabel environmental sound classification that significantly outperforms traditional MFCC-based methods on the South Asian SAS-KIIT and UrbanSound8K datasets, offering a more robust solution for complex, overlapping acoustic environments.

Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee, Tathagata Bandyopadhyay, Digonto Biswas, Bibek HowladerTue, 10 Ma💻 cs

PathBench: Speech Intelligibility Benchmark for Automatic Pathological Speech Assessment

This paper introduces PathBench, a unified benchmark for pathological speech intelligibility assessment that establishes systematic baselines across six public datasets and three evaluation protocols, while proposing the Dual-ASR Articulatory Precision (DArtP) method as a top-performing reference-free approach.

Bence Mark Halpern, Thomas Tienkamp, Defne Abur, Tomoki TodaTue, 10 Ma💻 cs

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

This paper introduces WhispEar, a bidirectional framework that leverages a normal-to-whisper model to generate scalable pseudo-parallel data for training a whisper-to-normal conversion system, thereby overcoming data scarcity challenges and achieving superior performance on a newly released bilingual whispered-normal corpus.

Zihao Fang, Yingda Shen, Zifan Guan, Tongtong Song, Zhenyi Liu, Zhizheng WuTue, 10 Ma💻 cs

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

This paper introduces MambaDance, a novel dance generation framework that replaces Transformers with a Mamba-based diffusion model and employs a Gaussian-based beat representation to effectively capture the sequential, rhythmic, and music-synchronized nature of dance across varying sequence lengths.

Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon JooTue, 10 Ma💻 cs

Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

This paper presents a modular, unsupervised domain adaptation pipeline that combines Wav2Vec 2.0 embeddings with statistical transformations like CORAL alignment and feature selection to significantly improve cross-domain generalization for audio deepfake detection without requiring labeled target data.

Urawee Thani, Gagandeep Singh, Priyanka SinghTue, 10 Ma💻 cs

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

SoundWeaver is a training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting generation from semantically similar cached audio, achieving a 1.8–3.0× latency reduction while preserving perceptual quality.

Ayush Barik, Sofia Stoica, Nikhil Sarda, Arnav Kethana, Abhinav Khanduja, Muchen Xu, Fan LaiTue, 10 Ma💻 cs

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

VoiceSHIELD-Small is a lightweight, real-time model built on Whisper-small that simultaneously transcribes speech and detects malicious content with 99.16% accuracy, offering a faster and more secure alternative to traditional text-based filtering for voice AI systems.

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas, Puneeth N AilTue, 10 Ma💻 cs

← Previous Next →