eess.AS papers | Gist.Science

Evaluating Parkinson's Disease Detection in Anonymized Speech: A Performance and Acoustic Analysis

This paper evaluates the trade-off between privacy and Parkinson's disease detection in anonymized speech, demonstrating that while STT-TTS anonymization severely degrades diagnostic performance by erasing prosodic cues, kNN-VC effectively preserves macro-prosodic features to maintain high detection accuracy with only a minor performance drop.

Carlos Franzreb, Francisco Teixeira, Ben Luks, Sebastian Möller, Alberto AbadTue, 10 Ma💻 cs

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

This paper introduces VASR, a multimodal reasoning framework for Context-Aware Visual Speech Recognition (CAVSR) that leverages an Audio-Visual Chain-of-Thought (AV-CoT) to explicitly ground acoustic signals with rich visual context like scenes and on-screen text, thereby overcoming single-modality dominance and achieving state-of-the-art performance.

Wenjie Tian, Mingchen Shao, Bingshen Mu, Xuelong Geng, Chengyou Wang, Yujie Liao, Zhixian Zhao, Ziyu Zhang, Jingbin Hu, Mengqi Wei, Lei XieTue, 10 Ma💻 cs

Toward Multimodal Industrial Fault Analysis: A Single-Speed Chain Conveyor Dataset with Audio and Vibration Signals

This paper introduces a comprehensive multimodal dataset comprising audio and vibration signals from a single-speed chain conveyor system, designed to benchmark robust industrial fault detection and classification under diverse operating conditions and noise levels through standardized evaluation protocols and baseline models.

Zhang Chen, Yucong Zhang, Xiaoxiao Miao, Ming LiTue, 10 Ma💻 cs

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

This paper introduces Task 5 of the DCASE 2025 Challenge, a multi-domain Audio Question Answering benchmark designed to evaluate and advance the acoustic reasoning capabilities of audio-language models across diverse scenarios including bioacoustics, temporal soundscapes, and complex real-world clips.

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan CatanzaroTue, 10 Ma💬 cs.CL

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

This paper proposes a zero-AV-resource framework for audiovisual speech recognition that generates synthetic talking-head videos by lip-syncing static facial images with real audio, successfully enabling high-performance model training for under-resourced languages like Catalan without the need for labeled video corpora.

Pol Buitrago, Pol Gàlvez, Oriol Pareras, Javier HernandoTue, 10 Ma💬 cs.CL

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

This paper introduces the Cross-Lingual Transfer Matrix (CLTM) to systematically quantify language-dependent performance variations in paralinguistic tasks like gender identification and speaker verification, revealing that despite their acoustic nature, these tasks exhibit distinct cross-lingual transfer patterns when using multilingual HuBERT-based encoders.

Pol Buitrago, Oriol Pareras, Federico Costa, Javier HernandoTue, 10 Ma💬 cs.CL

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

DualTurn is a dual-channel generative speech model that learns natural turn-taking dynamics through unsupervised pretraining on conversational audio and fine-tuning to predict agent actions, outperforming existing methods in both action prediction accuracy and turn-boundary anticipation while enabling tool-calling capabilities.

Shangeth RajaaTue, 10 Ma💬 cs.CL

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

This chapter reviews recent computational models demonstrating that self-supervised and visually grounded learning principles can effectively explain early language acquisition from acoustic and audiovisual speech without relying on strong linguistic priors.

Okko RäsänenTue, 10 Ma💬 cs.CL

Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

This paper demonstrates that scaling self-supervised speech models from 1,000 to 4,000 languages triggers a non-linear shift that enables the discovery of deep genealogical relationships and complex contact patterns, exemplified by the emergence of a robust Pacific macro-cluster driven by shared acoustic signatures.

Minu Kim, Hoirin Kim, David R. MortensenTue, 10 Ma💬 cs.CL

TCG CREST System Description for the DISPLACE-M Challenge

The TCG CREST system achieved a sixth-place ranking in the DISPLACE-M challenge's speaker diarization track by demonstrating that a hybrid end-to-end Diarizen framework with WavLM embeddings and optimized agglomerative hierarchical clustering significantly outperformed a SpeechBrain baseline, reducing the diarization error rate to 9.21% on the evaluation set.

Nikhil Raghav, Md SahidullahTue, 10 Ma🤖 cs.LG

LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

LongAudio-RAG is a hybrid edge-cloud framework that enables precise, low-hallucination question answering over multi-hour audio streams by converting recordings into timestamped event records for SQL-based retrieval, which then grounds Large Language Model responses in structured evidence rather than raw audio.

Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik VisserTue, 10 Ma🤖 cs.LG

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

The paper introduces ECHO, a novel foundation model that leverages band-split architecture and frequency positional embeddings to achieve state-of-the-art performance in anomaly detection and fault classification across variable-length, arbitrary sampling rate machine signals without requiring padding or cropping.

Yucong Zhang, Juan Liu, Ming LiTue, 10 Ma🤖 cs.LG

BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation

BemaGANv2 is an advanced GAN-based vocoder that enhances long-term audio generation for Text-to-Music and Text-to-Audio applications by integrating Anti-aliased Multi-Periodicity composition modules in the generator and systematically evaluating novel discriminator combination strategies, including the Multi-Envelope Discriminator, to achieve high-fidelity and temporally coherent results.

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul KwonTue, 10 Ma🤖 cs.LG

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

This paper introduces Trilobyte, a byte-level tokenization schema that enables tractable lossless compression of full-fidelity (up to 24-bit) audio using autoregressive language models, demonstrating that while these models outperform FLAC at lower bit depths, their compression gains diminish as bit depth increases.

Phillip Long, Zachary Novack, Chris DonahueTue, 10 Ma🤖 cs.LG

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

FoleyFlow introduces a novel video-to-audio generation framework that achieves superior semantic and rhythmic synchronization by aligning unimodal encoders through masked audio-visual modeling and employing a dynamic conditional flow that utilizes temporally varying video features to guide audio synthesis.

Shentong Mo, Yibing SongTue, 10 Ma🤖 cs.LG

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

This paper introduces an analysis-driven framework that generates a publicly available, 19-hour procedural engine sound dataset with sample-accurate RPM and torque annotations by extracting harmonic structures from real recordings to drive a parametric synthesizer, thereby addressing the scarcity of clean, standardized audio data for automotive sound design and machine learning applications.

Robin Doerfler, Lonce WyseTue, 10 Ma🤖 cs.LG

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

This paper proposes a lightweight, self-supervised framework that augments a frozen speech enhancement backbone with low-rank adapters, enabling efficient on-device adaptation to dynamic real-world noise conditions by updating fewer than 1% of parameters while achieving significant signal quality improvements.

Longbiao Cheng, Shih-Chii LiuTue, 10 Ma🤖 cs.LG

Fast and Flexible Audio Bandwidth Extension via Vocos

This paper proposes a fast and flexible Vocos-based bandwidth extension model that generates missing high-frequency audio content up to 48 kHz using a single neural network and a lightweight refiner, achieving competitive quality with extreme real-time throughput on both GPU and CPU hardware.

Yatharth SharmaTue, 10 Ma🤖 cs.LG

Towards Objective Gastrointestinal Auscultation: Automated Segmentation and Annotation of Bowel Sound Patterns

This study presents an automated pipeline using a wearable SonicGuard sensor and a pretrained Audio Spectrogram Transformer to accurately segment and classify bowel sounds, significantly reducing manual labeling time while providing clinicians with an objective, quantitative tool for assessing gastrointestinal function.

Zahra Mansour, Verena Uslar, Dirk Weyhe, Danilo Hollosi, Nils StrodthoffTue, 10 Ma🤖 cs.LG

Are Deep Speech Denoising Models Robust to Adversarial Noise?

This paper demonstrates that four recent deep speech denoising models are vulnerable to psychoacoustically hidden adversarial noise, which can render their output unintelligible while remaining imperceptible to human listeners, thereby highlighting critical safety concerns for their deployment in high-stakes applications.

Will Schwarzer, Neel Chaudhari, Philip S. Thomas, Andrea Fanelli, Xiaoyu LiuThu, 12 Ma⚡ eess

← Previous Next →