cs.SD papers | Gist.Science

Controllable Dance Generation with Style-Guided Motion Diffusion

This paper proposes Style-Guided Motion Diffusion (SGMD), a novel framework that integrates a Transformer-based architecture with a Style Modulation module and spatial-temporal masking to generate realistic, music-aligned dance sequences that are both stylistically consistent and flexibly controllable for tasks like trajectory generation, in-betweening, and inpainting.

Hongsong Wang, Ying Zhu, Xin Geng + 1 more2026-03-11⚡ eess

Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer

This study demonstrates that strong correlations exist between subjective perceptual ratings and objective acoustic measures in head and neck cancer patients, suggesting that a single intelligibility measure may be sufficient for clinical monitoring of speech following chemoradiation treatment.

Bence Mark Halpern, Thomas Tienkamp, Teja Rebernik + 4 more2026-03-10⚡ eess

Wave-like behaviour in (0,1) binary sequences

This paper presents a quantum-inspired extension of the GenomeBits model that characterizes finite (0,1) binary sequences, such as genome data, by treating them as wavefunctions to reveal sound-wave-like features in their real and imaginary spectra.

E. Canessa2026-03-10🔬 physics

ExSampling: a system for the real-time ensemble performance of field-recorded environmental sounds

The paper proposes ExSampling, an integrated system combining a recording application and a Deep Learning environment to enable the real-time ensemble performance of field-recorded environmental sounds through automated sound mapping to Ableton Live tracks.

Atsuya Kobayashi, Reo Anzai, Nao Tokui2026-03-10⚡ eess

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

This paper presents a technical tutorial demonstrating that building enterprise-grade realtime voice agents requires a cascaded streaming pipeline (STT $\rightarrow$ LLM $\rightarrow$ TTS) rather than native speech-to-speech models, achieving sub-second latency through the systematic integration of components like Deepgram, vLLM, and ElevenLabs.

Jielin Qiu, Zixiang Chen, Liangwei Yang + 11 more2026-03-06💻 cs

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

The paper proposes MSpoof-TTS, a training-free inference framework that enhances zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding strategy to prune artifacts and improve perceptual realism without retraining.

Junchuan Zhao, Minh Duc Vu, Ye Wang2026-03-06💻 cs

SLICE: Speech Enhancement via Layer-wise Injection of Conditioning Embeddings

The paper proposes SLICE, a speech enhancement method that improves performance on compound real-world degradations by injecting degradation conditioning into the timestep embedding to propagate through all residual blocks, outperforming both unconditioned models and prior input-level conditioning approaches.

Seokhoon Moon, Kyudan Jung, Jaegul Choo2026-03-06💻 cs

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

This paper introduces PolyBench, a new benchmark designed to evaluate compositional reasoning in polyphonic audio across five distinct tasks, revealing that current Large Audio Language Models consistently struggle with the complexity of concurrent sound events.

Yuanjian Chen, Yang Xiao, Han Yin + 3 more2026-03-06💻 cs

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

This paper introduces TW-Sound580K, a rigorously curated Taiwanese audio-text dataset created via a Verify-Generate-Critique protocol, which significantly enhances localized audio-language modeling performance when used to train the Tai-LALM model with dynamic arbitration strategies.

Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin + 4 more2026-03-06💻 cs

Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction

This paper proposes a training dynamics-aware multi-factor curriculum learning framework for target speaker extraction that jointly schedules multiple difficulty factors and utilizes the TSE-Datamap visualization tool to analyze training dynamics, thereby enabling data-driven progressive learning that significantly improves performance in complex multi-speaker scenarios.

Yun Liu, Xuechen Liu, Xiaoxiao Miao + 1 more2026-03-06💻 cs

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

This paper presents the first Environmental Sound Deepfake Detection (ESDD) challenge, detailing its task formulation, dataset, evaluation protocols, and key insights from 97 participating teams to advance robust detection methods and guide future research in this underexplored field.

Han Yin, Yang Xiao, Rohan Kumar Das + 2 more2026-03-06💻 cs

Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

This paper proposes Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves the noise robustness of Large Audio Language Models by separating speech from non-speech sounds and generating task-adaptive enhanced signals without requiring expensive model retraining.

Han Yin, Yang Xiao, Younghoo Kwon + 2 more2026-03-06💻 cs

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

SarcasmMiner is a reinforcement learning-based post-training framework that employs a dual-track distillation strategy with a generative reward model and group relative policy optimization to significantly enhance robust audio-visual sarcasm reasoning and reduce hallucinations in foundation models.

Zhu Li, Yongjian Chen, Huiyuan Lai + 3 more2026-03-06💬 cs.CL

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis

Latent-Mark is a novel zero-bit audio watermarking framework that achieves robustness against neural resynthesis by embedding watermarks within a codec's invariant latent space through cross-codec optimization, ensuring both semantic resilience and perceptual imperceptibility.

Yen-Shan Chen, Shih-Yu Lai, Ying-Jung Tsou + 5 more2026-03-06🤖 cs.AI

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

WavSLM is a single-stream speech language model that achieves competitive speech generation and consistency without text supervision by quantizing and distilling WavLM representations into a single codebook for autoregressive next-chunk prediction.

Luca Della Libera, Cem Subakan, Mirco Ravanelli2026-03-06🤖 cs.AI

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

This paper introduces ASR-TRA, a novel test-time reinforcement learning framework that leverages audio-text semantic rewards and causal intervention to overcome confirmation bias in existing adaptation methods, thereby significantly improving ASR robustness and accuracy in noisy and accented environments without ground-truth labels.

Linghan Fang, Tianxin Xie, Li Liu2026-03-06🤖 cs.AI

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

This paper presents WhisperAlign, a solution for the DL Sprint 4.0 that combines word-boundary-aware ASR using whisper-timestamped chunking and domain-fine-tuned Pyannote diarization anchored by WhisperX to achieve high-accuracy transcription and speaker separation for long-form Bengali speech.

Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees2026-03-06💻 cs

When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

This paper demonstrates that applying the SAM-Audio speech enhancement model as a preprocessing step for zero-shot ASR with Whisper consistently degrades recognition accuracy despite improving perceptual audio quality, revealing a fundamental mismatch between human-perceived signal cleanliness and machine recognition robustness.

Akif Islam, Raufun Nahar, Md. Ekramul Hamid2026-03-06💻 cs

Temporal Pooling Strategies for Training-Free Anomalous Sound Detection with Self-Supervised Audio Embeddings

This paper addresses the underexplored role of temporal pooling in training-free anomalous sound detection by proposing and evaluating adaptive strategies, specifically Relative Deviation Pooling (RDP) and a hybrid approach, which achieve state-of-the-art performance across multiple benchmarks and outperform previously reported trained systems.

Kevin Wilkinghoff, Sarthak Yadav, Zheng-Hua Tan2026-03-06💻 cs

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

This paper introduces VoxKnesset, a large-scale open-access dataset of 2,300 hours of longitudinal Hebrew parliamentary speech spanning 2009–2025, which is used to benchmark and demonstrate the challenges of speaker verification and age prediction over time, revealing significant performance degradation in standard models as speakers age.

Yanir Marmor, Arad Zulti, David Krongauz + 4 more2026-03-06💻 cs

← Previous Next →