eess.AS papers | Gist.Science

Acoustic and Semantic Modeling of Emotion in Spoken Language

This thesis advances emotion understanding and synthesis in spoken language by proposing methods to jointly model acoustic and semantic information through pre-training strategies, hierarchical conversational architectures, and a textless speech-to-speech framework for controllable emotion style transfer.

Soumya DuttaWed, 11 Ma⚡ eess

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

The paper proposes SPAR-K, a modality-aware early exit framework that accelerates interleaved spoken language model inference by employing a scheduled alternating-depth strategy for speech tokens, achieving significant reductions in decoding depth while preserving question-answering accuracy and perceptual quality without auxiliary overhead.

Hsiao-Ying Huang, Cheng-Han Chiang, Hung-yi LeeWed, 11 Ma💬 cs.CL

Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models

This paper proposes an Emotion-Aware Prefix for a two-stage voice conversion backbone that significantly improves emotion conversion accuracy from 42.40% to 85.50% while preserving speaker identity, linguistic integrity, and speech quality through joint control of sequence modulation and acoustic realization.

Haoyuan Yang, Mu Yang, Jiamin Xie, Szu-Jui Chen, John H. L. HansenWed, 11 Ma⚡ eess

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

This paper demonstrates that neural audio codecs achieve optimal adversarial robustness in speech recognition at intermediate residual vector quantization depths, which effectively balance the suppression of adversarial perturbations with the preservation of speech content, outperforming traditional compression defenses.

Jordan Prescott, Thanathai Lertpetchpun, Shrikanth NarayananWed, 11 Ma⚡ eess

Universal Speech Content Factorization

The paper proposes Universal Speech Content Factorization (USCF), a simple and invertible linear method that extracts low-rank, speaker-independent speech representations to enable competitive zero-shot voice conversion and efficient training of timbre-prompted text-to-speech models using minimal target speaker data.

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew WiesnerWed, 11 Ma⚡ eess

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

This paper proposes a novel bottleneck transformer architecture that integrates convolutional blocks for frame-level feature extraction and multi-head self-attention for information aggregation to achieve improved non-intrusive prediction of the Short-Time Objective Intelligibility (STOI) metric, outperforming state-of-the-art self-supervised learning models in both seen and unseen scenarios.

Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mondal, Prasanta Kumar GhoshWed, 11 Ma🤖 cs.LG

Latent Speech-Text Transformer

The Latent Speech-Text Transformer (LST) improves the efficiency and performance of auto-regressive speech-text models by aggregating speech tokens into latent patches, which aligns sequence granularity with text, reduces computational costs, and achieves significant accuracy gains across speech and text benchmarks.

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc LeWed, 11 Ma🤖 cs.AI

VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

VoiceBridge is a novel one-step latent bridge model that leverages an energy-preserving variational autoencoder and a joint neural prior to efficiently reconstruct high-quality 48 kHz fullband speech from diverse distortions across various in-domain and out-of-domain general speech restoration tasks without requiring distillation.

Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun ZhuWed, 11 Ma🤖 cs.AI

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

VSSFlow introduces a unified flow-matching framework that seamlessly integrates Video-to-Sound and Visual Text-to-Speech generation through a disentangled condition aggregation mechanism, demonstrating that joint learning can surpass specialized state-of-the-art baselines without performance degradation.

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua SongWed, 11 Ma🤖 cs.AI

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi LeeWed, 11 Ma🤖 cs.AI

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

This paper introduces the Pulse-Train-Resonator (PTR), a differentiable synthesis model that improves engine sound generation by directly modeling sequential exhaust pressure pulses and physical resonances rather than approximating spectral characteristics, achieving superior reconstruction accuracy and interpretability across diverse engine types.

Robin Doerfler, Lonce WyseWed, 11 Ma🤖 cs.AI

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

The paper introduces VoxEmo, a comprehensive benchmark and toolkit for evaluating Speech Large Language Models on speech emotion recognition across 35 corpora and 15 languages, featuring a distribution-aware soft-label protocol that reveals how these models uniquely align with human subjective emotion distributions despite trailing supervised baselines in hard-label accuracy.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas HainWed, 11 Ma🤖 cs.AI

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

This paper introduces LibriTTS-VI, the first public corpus for numerical voice impression control, and proposes novel training methods to mitigate impression leakage, achieving significantly improved controllability over existing prompt-based approaches.

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki KumakuraTue, 10 Ma💻 cs

SUBARU: A Practical Approach to Power Saving in Hearables Using SUB-Nyquist Audio Resolution Upsampling

The paper proposes SUBARU, a power-efficient framework for hearables that intentionally employs sub-Nyquist sampling and low bit-resolution ADCs to achieve a 3.31x reduction in power consumption while maintaining high-quality multimodal speech enhancement through a novel wideband reconstruction methodology.

Tarikul Islam Tamiti, Sajid Fardin Dipto, Luke Benjamin Baja-Ricketts, David C Vergano, Anomadarshi BaruaTue, 10 Ma💻 cs

WaLi: Can Pressure Sensors in HVAC Systems Capture Human Speech?

This paper introduces WaLi, a novel attack framework that leverages standard HVAC pressure sensors to reconstruct intelligible human speech from noisy, low-resolution data using a complex-valued conformer and Complex Global Attention Block, thereby revealing a significant, previously unaddressed privacy vulnerability in building infrastructure.

Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Anomadarshi BaruaTue, 10 Ma💻 cs

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

ExpGest is a novel diffusion-based framework that generates expressive, controllable full-body gestures by leveraging synchronized audio and text guidance, along with a specialized noise emotion classifier, to overcome the limitations of existing methods that often produce stiff, upper-body-only movements.

Yongkang Cheng, Mingjiang Liang, Shaoli Huang, Gaoge Han, Jifeng Ning, Wei LiuTue, 10 Ma💻 cs

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

This paper introduces the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel framework that integrates audio keyword spotting with diffusion-refined video segmentation to automatically extract key metrics and features for the objective assisted diagnosis of Vocal Fold Paralysis.

Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Juan Liu, Faya Liang, Ming LiTue, 10 Ma💻 cs

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

This paper introduces a systematic framework for Large Audio-Language Models that reformulates ambiguous emotion recognition as a distributional reasoning problem, utilizing an ambiguity-aware objective and structured chain-of-thought supervision to significantly improve performance on standard benchmarks.

Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting DangTue, 10 Ma💻 cs

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

This paper introduces WhispEar, a bidirectional framework that leverages a normal-to-whisper model to generate scalable pseudo-parallel data for training a whisper-to-normal conversion system, thereby overcoming data scarcity challenges and achieving superior performance on a newly released bilingual whispered-normal corpus.

Zihao Fang, Yingda Shen, Zifan Guan, Tongtong Song, Zhenyi Liu, Zhizheng WuTue, 10 Ma💻 cs

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

SoundWeaver is a training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting generation from semantically similar cached audio, achieving a 1.8–3.0× latency reduction while preserving perceptual quality.

Ayush Barik, Sofia Stoica, Nikhil Sarda, Arnav Kethana, Abhinav Khanduja, Muchen Xu, Fan LaiTue, 10 Ma💻 cs

← Previous Next →