cs.SD papers | Gist.Science

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

This paper introduces a probabilistic early-exit framework for single-channel speech separation and enhancement that dynamically scales computational resources based on uncertainty-aware signal quality estimates, enabling efficient deployment on heterogeneous devices without compromising reconstruction performance.

Kenny Falkær Olsen, Mads Østergaard, Karl Ulbæk + 4 more2026-03-05🤖 cs.LG

Low-Resource Guidance for Controllable Latent Audio Diffusion

This paper introduces a low-resource, guidance-based approach using Latent-Control Heads (LatCHs) that enables efficient, fine-grained control over intensity, pitch, and beats in latent audio diffusion models by operating directly in latent space, thereby avoiding the high computational costs of decoder backpropagation while maintaining audio fidelity.

Zachary Novack, Zack Zukowski, CJ Carr + 6 more2026-03-05🤖 cs.AI

LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance

This paper introduces LabelBuddy, an open-source collaborative tool that bridges the gap between human intent and machine understanding in Music Information Retrieval by decoupling the annotation interface from containerized AI backends to enable flexible, AI-assisted pre-tagging and multi-user consensus.

Ioannis Prokopiou, Ioannis Sina, Agisilaos Kounelis + 2 more2026-03-05🤖 cs.AI

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

The paper proposes ZeSTA, a domain-conditioned training framework that effectively leverages zero-shot TTS synthetic data for low-resource personalized speech synthesis by distinguishing real and synthetic inputs via lightweight embeddings and real-data oversampling, thereby improving speaker similarity without compromising quality.

Youngwon Choi, Jinwoo Oh, Hwayeon Kim + 1 more2026-03-05🤖 cs.AI

ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition

The paper introduces ACES, a representation-centric audit revealing that accent information in ASR models is concentrated in a low-dimensional early-layer subspace where perturbations strongly correlate with performance degradation, yet simple linear attenuation fails to reduce disparities due to the deep entanglement of accent features with recognition-critical cues.

Swapnil Parekh2026-03-05🤖 cs.AI

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

This paper addresses the evaluation gap in music generation by introducing CMI-RewardBench, a comprehensive ecosystem comprising large-scale datasets, a unified benchmark, and efficient reward models to evaluate and improve music generation under complex compositional multimodal instructions.

Yinghao Ma, Haiwen Xia, Hewei Gao + 9 more2026-03-05🤖 cs.AI

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

This paper introduces LadderSym, a novel multimodal interleaved Transformer that improves music practice error detection by employing a two-stream encoder with inter-stream alignment and using symbolic scores as decoder prompts to overcome the limitations of late fusion and frequency ambiguity, thereby significantly outperforming state-of-the-art methods on benchmark datasets.

Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos + 4 more2026-03-05🤖 cs.AI

MeanFlowSE: one-step generative speech enhancement via conditional mean flow

MeanFlowSE is a novel one-step generative speech enhancement model that learns conditional average velocities over finite intervals to enable efficient, high-fidelity single-step inference without requiring knowledge distillation or multistep solvers.

Duojia Li, Shenghui Lu, Hongchen Pan + 3 more2026-03-05🤖 cs.AI

← Previous