eess.AS papers | Gist.Science

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

This paper introduces PolyBench, a new benchmark designed to evaluate compositional reasoning in polyphonic audio across five distinct tasks, revealing that current Large Audio Language Models consistently struggle with the complexity of concurrent sound events.

Yuanjian Chen, Yang Xiao, Han Yin + 3 more2026-03-06💻 cs

Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

This paper investigates model merging as a scalable alternative to full fine-tuning for multi-domain ASR, benchmarking 11 algorithms across 10 European Portuguese domains and introducing a novel "BoostedTSV-M" method that outperforms full fine-tuning while preserving out-of-distribution generalization.

Carlos Carvalho, Francisco Teixeira, Thomas Rolland + 1 more2026-03-06💬 cs.CL

Visual-Informed Speech Enhancement Using Attention-Based Beamforming

This paper proposes Visual-Informed Neural Beamforming Network (VI-NBFNet), an end-to-end audiovisual framework that integrates lip movement features extracted from a pretrained visual model with microphone array signals via an attention mechanism to significantly enhance speech quality and robustness in challenging, dynamic, and low-SNR environments.

Chihyun Liu, Jiaxuan Fan, Mingtung Sun + 3 more2026-03-06🤖 cs.AI

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

This paper presents a novel framework for the simultaneous acquisition of real-time MRI, EEG, and surface EMG to capture brain, muscle, and articulatory activity during speech, featuring a specialized artifact suppression pipeline to overcome technical challenges and enable unprecedented insights into speech neuroscience.

Jihwan Lee, Parsa Razmara, Kevin Huang + 16 more2026-03-06🤖 cs.AI

Temporal Pooling Strategies for Training-Free Anomalous Sound Detection with Self-Supervised Audio Embeddings

This paper addresses the underexplored role of temporal pooling in training-free anomalous sound detection by proposing and evaluating adaptive strategies, specifically Relative Deviation Pooling (RDP) and a hybrid approach, which achieve state-of-the-art performance across multiple benchmarks and outperform previously reported trained systems.

Kevin Wilkinghoff, Sarthak Yadav, Zheng-Hua Tan2026-03-06💻 cs

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

This paper introduces VoxKnesset, a large-scale open-access dataset of 2,300 hours of longitudinal Hebrew parliamentary speech spanning 2009–2025, which is used to benchmark and demonstrate the challenges of speaker verification and age prediction over time, revealing significant performance degradation in standard models as speakers age.

Yanir Marmor, Arad Zulti, David Krongauz + 4 more2026-03-06💻 cs

Fine-grained Soundscape Control for Augmented Hearing

This paper introduces Aurchestra, a novel system for resource-constrained hearables that enables real-time, fine-grained control over up to five overlapping sound sources by combining a dynamic interface with an optimized on-device multi-output extraction network, effectively transforming the acoustic environment into a programmable mix.

Seunghyun Oh, Malek Itani, Aseem Gauri + 1 more2026-03-06💻 cs

RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

This paper introduces RA-QA, a comprehensive benchmarking system featuring a standardized pipeline, a large-scale dataset of 9 million diverse question-answer pairs, and a unified evaluation protocol to assess and expose the limitations of respiratory audio question-answering models under real-world heterogeneity.

Gaia A. Bertolino, Yuwei Zhang, Tong Xia + 2 more2026-03-06💻 cs

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

This paper proposes a multi-loss learning framework for speech emotion recognition that integrates energy-adaptive mixup and frame-level attention to address data scarcity and emotional complexity, achieving state-of-the-art performance across four benchmark datasets.

Cong Wang, Yizhong Geng, Yuhua Wen + 7 more2026-03-06💻 cs

Schrödinger Bridge Mamba for One-Step Speech Enhancement

The paper introduces Schrödinger Bridge Mamba (SBM), a novel one-step speech enhancement model that synergizes the Schrödinger Bridge training paradigm with the Mamba architecture to achieve superior denoising and dereverberation performance with real-time feasibility, outperforming strong generative and discriminative baselines.

Jing Yang, Sirui Wang, Chao Wu + 2 more2026-03-06💻 cs

Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

This paper introduces Noise-to-Notes (N2N), a state-of-the-art diffusion-based framework that redefines automatic drum transcription as a conditional generative task, utilizing an Annealed Pseudo-Huber loss for joint optimization and music foundation model features to achieve superior robustness and performance across multiple benchmarks.

Michael Yeung, Keisuke Toyama, Toya Teramoto + 2 more2026-03-06💻 cs

Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones

This paper reveals that SpeechLLM backbones struggle with conversational disfluencies due to a bias toward semantic abstraction over structural fidelity, with performance varying by architecture and fine-tuning often compromising generalization despite achieving state-of-the-art results.

Maria Teleki, Sai Janjur, Haoran Liu + 11 more2026-03-06💻 cs

SAM: A Mamba-2 State-Space Audio-Language Model

The paper introduces SAM, a State-space Audio-language Model leveraging a Mamba-2 backbone that achieves competitive performance with fewer parameters than larger transformer models while establishing key design principles regarding joint encoder finetuning, optimal token representation, and instruction-following supervision.

Taehan Lee, Jaehan Jung, Hyukjun Lee2026-03-06💻 cs

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

The paper introduces BabyHuBERT, a multilingual self-supervised speech model trained on 13,000 hours of child-centered recordings that significantly outperforms existing adult-focused models in segmenting speakers within diverse, naturalistic child language datasets.

Théo Charlot, Tarek Kunze, Maxime Poli + 3 more2026-03-06💻 cs

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

This paper proposes TSPC, a novel two-stage phoneme-centric architecture that leverages an extended Vietnamese phoneme set as an intermediate representation to significantly improve Vietnamese-English code-switching speech recognition accuracy while maintaining computational efficiency.

Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam + 1 more2026-03-06💻 cs

A Large-Scale Probing Analysis of Speaker-Specific Attributes in Self-Supervised Speech Representations

This study conducts a large-scale probing analysis of 11 self-supervised speech models to reveal a hierarchical encoding of speaker attributes, challenging the assumption that final layers are purely linguistic by showing that larger models recover speaker identity in deep layers while intermediate representations better capture dynamic prosody than specialized embeddings.

Aemon Yat Fei Chiu, Kei Ching Fung, Roger Tsz Yeung Li + 2 more2026-03-06💻 cs

Multi-Stage Music Source Restoration with BandSplit-RoFormer Separation and HiFi++ GAN

This paper presents the CP-JKU team's two-stage system for the ICASSP 2025 Music Source Restoration Challenge, which combines a curriculum-trained BandSplit-RoFormer model for separating eight stems and a specialized HiFi++ GAN for restoring instrument-specific waveforms from mastered audio.

Tobias Morocutti, Emmanouil Karystinaios, Jonathan Greif + 1 more2026-03-05🤖 cs.LG

Automated Measurement of Geniohyoid Muscle Thickness During Speech Using Deep Learning and Ultrasound

This paper introduces SMMA, a fully automated deep learning framework that accurately measures geniohyoid muscle thickness during speech, enabling scalable analysis of speech motor control and objective assessment of related disorders by eliminating the need for time-consuming manual annotation.

Alisher Myrgyyassov, Bruce Xiao Wang, Yu Sun + 4 more2026-03-05🤖 cs.LG

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

This paper introduces a probabilistic early-exit framework for single-channel speech separation and enhancement that dynamically scales computational resources based on uncertainty-aware signal quality estimates, enabling efficient deployment on heterogeneous devices without compromising reconstruction performance.

Kenny Falkær Olsen, Mads Østergaard, Karl Ulbæk + 4 more2026-03-05🤖 cs.LG

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

The paper proposes ZeSTA, a domain-conditioned training framework that effectively leverages zero-shot TTS synthetic data for low-resource personalized speech synthesis by distinguishing real and synthetic inputs via lightweight embeddings and real-data oversampling, thereby improving speaker similarity without compromising quality.

Youngwon Choi, Jinwoo Oh, Hwayeon Kim + 1 more2026-03-05🤖 cs.AI

← Previous Next →