When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper
This paper demonstrates that applying the SAM-Audio speech enhancement model as a preprocessing step for zero-shot ASR with Whisper consistently degrades recognition accuracy despite improving perceptual audio quality, revealing a fundamental mismatch between human-perceived signal cleanliness and machine recognition robustness.