cs.SD papers | Gist.Science

When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

This paper demonstrates that applying the SAM-Audio speech enhancement model as a preprocessing step for zero-shot ASR with Whisper consistently degrades recognition accuracy despite improving perceptual audio quality, revealing a fundamental mismatch between human-perceived signal cleanliness and machine recognition robustness.

Akif Islam, Raufun Nahar, Md. Ekramul Hamid2026-03-06💻 cs

Temporal Pooling Strategies for Training-Free Anomalous Sound Detection with Self-Supervised Audio Embeddings

This paper addresses the underexplored role of temporal pooling in training-free anomalous sound detection by proposing and evaluating adaptive strategies, specifically Relative Deviation Pooling (RDP) and a hybrid approach, which achieve state-of-the-art performance across multiple benchmarks and outperform previously reported trained systems.

Kevin Wilkinghoff, Sarthak Yadav, Zheng-Hua Tan2026-03-06💻 cs

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

This paper introduces VoxKnesset, a large-scale open-access dataset of 2,300 hours of longitudinal Hebrew parliamentary speech spanning 2009–2025, which is used to benchmark and demonstrate the challenges of speaker verification and age prediction over time, revealing significant performance degradation in standard models as speakers age.

Yanir Marmor, Arad Zulti, David Krongauz + 4 more2026-03-06💻 cs

Fine-grained Soundscape Control for Augmented Hearing

This paper introduces Aurchestra, a novel system for resource-constrained hearables that enables real-time, fine-grained control over up to five overlapping sound sources by combining a dynamic interface with an optimized on-device multi-output extraction network, effectively transforming the acoustic environment into a programmable mix.

Seunghyun Oh, Malek Itani, Aseem Gauri + 1 more2026-03-06💻 cs

RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

This paper introduces RA-QA, a comprehensive benchmarking system featuring a standardized pipeline, a large-scale dataset of 9 million diverse question-answer pairs, and a unified evaluation protocol to assess and expose the limitations of respiratory audio question-answering models under real-world heterogeneity.

Gaia A. Bertolino, Yuwei Zhang, Tong Xia + 2 more2026-03-06💻 cs

MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

This paper introduces MultiAPI Spoof, a large-scale dataset featuring 230 hours of synthetic speech from 30 diverse APIs, and proposes Nes2Net-LA, a local-attention network that achieves state-of-the-art performance in speech anti-spoofing detection and API tracing under real-world conditions.

Xueping Zhang, Zhenshan Zhang, Yechen Wang + 3 more2026-03-06💻 cs

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

This paper proposes a multi-loss learning framework for speech emotion recognition that integrates energy-adaptive mixup and frame-level attention to address data scarcity and emotional complexity, achieving state-of-the-art performance across four benchmark datasets.

Cong Wang, Yizhong Geng, Yuhua Wen + 7 more2026-03-06💻 cs

Schrödinger Bridge Mamba for One-Step Speech Enhancement

The paper introduces Schrödinger Bridge Mamba (SBM), a novel one-step speech enhancement model that synergizes the Schrödinger Bridge training paradigm with the Mamba architecture to achieve superior denoising and dereverberation performance with real-time feasibility, outperforming strong generative and discriminative baselines.

Jing Yang, Sirui Wang, Chao Wu + 2 more2026-03-06💻 cs

Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

This paper introduces Noise-to-Notes (N2N), a state-of-the-art diffusion-based framework that redefines automatic drum transcription as a conditional generative task, utilizing an Annealed Pseudo-Huber loss for joint optimization and music foundation model features to achieve superior robustness and performance across multiple benchmarks.

Michael Yeung, Keisuke Toyama, Toya Teramoto + 2 more2026-03-06💻 cs

SAM: A Mamba-2 State-Space Audio-Language Model

The paper introduces SAM, a State-space Audio-language Model leveraging a Mamba-2 backbone that achieves competitive performance with fewer parameters than larger transformer models while establishing key design principles regarding joint encoder finetuning, optimal token representation, and instruction-following supervision.

Taehan Lee, Jaehan Jung, Hyukjun Lee2026-03-06💻 cs

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

The paper introduces BabyHuBERT, a multilingual self-supervised speech model trained on 13,000 hours of child-centered recordings that significantly outperforms existing adult-focused models in segmenting speakers within diverse, naturalistic child language datasets.

Théo Charlot, Tarek Kunze, Maxime Poli + 3 more2026-03-06💻 cs

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

This paper proposes TSPC, a novel two-stage phoneme-centric architecture that leverages an extended Vietnamese phoneme set as an intermediate representation to significantly improve Vietnamese-English code-switching speech recognition accuracy while maintaining computational efficiency.

Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam + 1 more2026-03-06💻 cs

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

Vevo2 is a unified framework for controllable speech and singing voice generation that employs novel audio tokenizers and a two-stage modeling approach with specialized training strategies to achieve flexible control over content, prosody, style, and timbre while demonstrating strong generalization across diverse synthesis tasks.

Xueyao Zhang, Junan Zhang, Yuancheng Wang + 5 more2026-03-06💻 cs

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

InterActHuman is a novel framework that enables high-quality multi-concept human animation by enforcing strong, region-specific binding of text, image, and audio conditions to individual identities, thereby overcoming the limitations of global-conditioning methods in scenarios involving complex human-human and human-object interactions.

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang + 7 more2026-03-06💻 cs

A Large-Scale Probing Analysis of Speaker-Specific Attributes in Self-Supervised Speech Representations

This study conducts a large-scale probing analysis of 11 self-supervised speech models to reveal a hierarchical encoding of speaker attributes, challenging the assumption that final layers are purely linguistic by showing that larger models recover speaker identity in deep layers while intermediate representations better capture dynamic prosody than specialized embeddings.

Aemon Yat Fei Chiu, Kei Ching Fung, Roger Tsz Yeung Li + 2 more2026-03-06💻 cs

FastWave: Optimized Diffusion Model for Audio Super-Resolution

The paper introduces FastWave, a lightweight and computationally efficient diffusion-based model for audio super-resolution to 48 kHz that achieves state-of-the-art performance with significantly lower resource requirements and faster training compared to existing high-parametric diffusion and flow models.

Nikita Kuznetsov, Maksim Kaledin2026-03-05🤖 cs.LG

Multi-Stage Music Source Restoration with BandSplit-RoFormer Separation and HiFi++ GAN

This paper presents the CP-JKU team's two-stage system for the ICASSP 2025 Music Source Restoration Challenge, which combines a curriculum-trained BandSplit-RoFormer model for separating eight stems and a specialized HiFi++ GAN for restoring instrument-specific waveforms from mastered audio.

Tobias Morocutti, Emmanouil Karystinaios, Jonathan Greif + 1 more2026-03-05🤖 cs.LG

Automated Measurement of Geniohyoid Muscle Thickness During Speech Using Deep Learning and Ultrasound

This paper introduces SMMA, a fully automated deep learning framework that accurately measures geniohyoid muscle thickness during speech, enabling scalable analysis of speech motor control and objective assessment of related disorders by eliminating the need for time-consuming manual annotation.

Alisher Myrgyyassov, Bruce Xiao Wang, Yu Sun + 4 more2026-03-05🤖 cs.LG

OASI: Objective-Aware Surrogate Initialization for Multi-Objective Bayesian Optimization in TinyML Keyword Spotting

This paper proposes Objective-Aware Surrogate Initialization (OASI), a method that seeds multi-objective Bayesian optimization with Pareto-biased solutions to efficiently discover memory-feasible, high-accuracy keyword spotting models for resource-constrained TinyML hardware.

Soumen Garai, Danilo Pau, Suman Samui2026-03-05🤖 cs.LG

Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks

This study demonstrates that recent self-supervised audio models with superior performance on diverse downstream tasks exhibit stronger alignment with human auditory cortex activity, suggesting that brain-like representations emerge naturally as a byproduct of learning to reconstruct naturalistic audio data.

Leonardo Pepino, Pablo Riera, Juan Kamienkowski + 1 more2026-03-05🤖 cs.LG

← Previous Next →