ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

The paper proposes ZeSTA, a domain-conditioned training framework that effectively leverages zero-shot TTS synthetic data for low-resource personalized speech synthesis by distinguishing real and synthetic inputs via lightweight embeddings and real-data oversampling, thereby improving speaker similarity without compromising quality.

Youngwon Choi, Jinwoo Oh, Hwayeon Kim + 1 more2026-03-05🤖 cs.AI

ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition

The paper introduces ACES, a representation-centric audit revealing that accent information in ASR models is concentrated in a low-dimensional early-layer subspace where perturbations strongly correlate with performance degradation, yet simple linear attenuation fails to reduce disparities due to the deep entanglement of accent features with recognition-critical cues.

Swapnil Parekh2026-03-05🤖 cs.AI

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

This paper introduces LadderSym, a novel multimodal interleaved Transformer that improves music practice error detection by employing a two-stream encoder with inter-stream alignment and using symbolic scores as decoder prompts to overcome the limitations of late fusion and frequency ambiguity, thereby significantly outperforming state-of-the-art methods on benchmark datasets.

Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos + 4 more2026-03-05🤖 cs.AI