Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Imagine you are trying to have a serious conversation with a friend.

Scenario A: You are in a quiet, soundproof recording studio. Your friend speaks clearly, and you hear every word perfectly. This is how most computer "speech-to-text" systems (like Siri or Google Assistant) are currently trained. They are like students who only study in a silent library.

Scenario B: You are in a large, empty cathedral with high ceilings and hard stone walls. Your friend speaks, but their voice bounces off the walls, creating an echo that mixes with the original sound. This is reverberation. In the real world, this happens in kitchens, gyms, and offices.

The Problem

The paper introduces a new tool called Whisper-RIR-Mega. Think of this as a "Echo Training Gym" for speech computers.

Previously, researchers didn't have a perfect way to test how well these computers handle echoes. Some tests used fake echoes, others didn't compare the "clean" voice to the "echoy" voice side-by-side. It was like testing a runner's speed on a track, but never seeing how they perform on a muddy field.

The Solution: A Perfect Match

The authors created a dataset where every single sentence has a twin:

The Clean Twin: The original sentence recorded in a quiet studio (from a famous dataset called LibriSpeech).
The Echo Twin: The exact same sentence, but mathematically "shouted" into a virtual room with specific echo characteristics (like a long, booming hall or a small, tinny bathroom).

They created 1,600 of these pairs, carefully balancing them so the test includes rooms with short echoes, long echoes, and everything in between.

The Experiment: The Whisper Models

The researchers tested five different versions of a popular speech AI called Whisper. You can think of these models as students of different sizes and intelligence levels:

Whisper-tiny: A very small, fast student (good for phones, but maybe not the smartest).
Whisper-large-v3: A giant, highly educated scholar (very smart, but takes more energy to run).

They asked each student to transcribe the sentences in both the Quiet Studio and the Echoy Hall.

The Results: Size Matters

Here is what they found, using a simple analogy:

The Small Student (Whisper-tiny): In the quiet studio, they got about half the words right. But in the echoey hall, they got completely confused. Their score dropped by 15.5 points. They were like a person trying to read a book while someone was shouting over a drum solo.
The Big Scholar (Whisper-large-v3): In the quiet studio, they were already excellent. In the echoey hall, they stumbled a little, but only lost 2.3 points. They were like a wise old professor who could ignore the background noise and still hear the main point.

The Big Takeaway: The bigger, smarter the AI model, the better it is at ignoring echoes. However, every model got worse when echoes were present.

Why This Matters

This paper is important because it gives researchers a standard ruler to measure how "echo-proof" their new AI models are.

Before: Developers might build a model that works great in a studio but fails in a real kitchen.
Now: They can use this "Echo Training Gym" to see exactly how much their model struggles with room acoustics and fix it.

The Bottom Line

The authors have released this dataset, the code, and a "leaderboard" (like a high-score list) for free. They want the whole world of AI researchers to use this tool to build speech assistants that don't just work in perfect silence, but can actually understand us when we're shouting in a noisy, echoey room.

In short: They built a simulator to teach computers how to listen in a cave, and they proved that the "bigger brains" handle the cave much better than the "small brains."

Here is a detailed technical summary of the paper "Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics."

1. Problem Statement

Automatic Speech Recognition (ASR) systems are typically trained and evaluated on clean, close-talk recordings. However, in real-world environments, speech signals are distorted by room acoustics, specifically reflections and reverberation. These acoustic alterations degrade recognition accuracy.

Existing benchmarks for reverberant speech suffer from several limitations:

They often lack paired clean references, making it difficult to isolate the specific impact of reverberation.
They rely on synthetic or limited Room Impulse Response (RIR) sets.
They fail to stratify test sets by critical acoustic metrics such as Reverberation Time (RT60) and Direct-to-Reverberant Ratio (DRR), leading to unbalanced evaluations.

The paper addresses the need for a controlled, reproducible benchmark to evaluate ASR robustness against room acoustics and to quantify the "reverb penalty" across different model architectures.

2. Methodology

Dataset Construction (Whisper-RIR-Mega)

The authors constructed a new benchmark dataset by pairing clean speech with reverberant versions generated using real-world acoustic data.

Speech Source: 16 kHz utterances from the LibriSpeech test-clean corpus.
Acoustic Source: Real Room Impulse Responses (RIRs) from the RIR-Mega corpus, which includes metadata for RT60 and DRR.
Pairing Strategy: Each clean utterance is convolved with a single RIR from RIR-Mega.
- Stratified Sampling: When metadata is available, RIRs are sampled across quantile bins of RT60 and DRR to ensure the dataset is balanced across diverse acoustic conditions.
- Signal Processing: The clean waveform is convolved with the RIR at 16 kHz. The RIR energy is normalized before convolution, and the output is peak-normalized. No background noise is added to isolate the effect of reverberation.
Dataset Split: A total of 2,000 paired samples were generated.
- Test Set: 1,600 samples (used for evaluation).
- Validation Set: 400 samples.
- Training Set: None (the benchmark is designed for evaluation only).
- The split is deterministic and stratified to ensure both sets reflect similar distributions of room acoustics.

Experimental Setup

Models Evaluated: Five OpenAI Whisper models of varying sizes: tiny, base, small, medium, and large-v3.
Decoding Configuration:
- Beam size: 5
- Best-of: 5
- Temperature: 0
- Language: English
- Hardware: CPU (FP16 disabled for reproducibility).
Metrics:
- Word Error Rate (WER) and Character Error Rate (CER).
- Calculated using the jiwer library with standard normalization (lowercase, punctuation removal, whitespace collapse).
- Reverb Penalty: Defined as the difference ( $\Delta$ ) between the reverberant condition and the clean condition (Reverb WER - Clean WER).

3. Key Contributions

Whisper-RIR-Mega Dataset: A publicly available benchmark of 2,000 paired clean-reverberant samples derived from LibriSpeech and RIR-Mega.
Stratified Evaluation: The first benchmark to explicitly stratify splits by acoustic metrics (RT60 and DRR), ensuring a balanced representation of room conditions.
Paired Design: By pairing every clean utterance with its specific reverberant counterpart, the benchmark allows for precise calculation of the "reverb penalty" ( $\Delta$ WER), isolating the impact of acoustics from other variables.
Baseline Results: Comprehensive evaluation of five Whisper models, establishing a new standard for measuring ASR robustness to room acoustics.
Reproducibility: Full release of the dataset (Hugging Face), evaluation code (GitHub), and an interactive leaderboard (Hugging Face Spaces).

4. Results

The evaluation on 1,600 test samples yielded the following key findings:

Universal Degradation: Reverberation consistently increased both WER and CER across all model sizes.
Model Size vs. Robustness: There is a monotonic relationship between model size and robustness. Larger models exhibit significantly smaller performance degradation.
- Whisper-tiny: Suffered the largest penalty, with a $\Delta$ WER of 15.50 percentage points (Clean: 54.88% $\to$ Reverb: 70.38%).
- Whisper-large-v3: Showed the highest robustness, with the smallest penalty of $\Delta$ WER of 2.31 percentage points (Clean: 29.00% $\to$ Reverb: 31.31%).
- Intermediate Models: small and medium showed penalties of 7.44 and 5.94 percentage points, respectively.
CER Trends: Character Error Rate followed a similar pattern, with tiny showing the largest $\Delta$ CER (3.80 pp) and medium showing the smallest (0.48 pp).

5. Significance and Future Work

Robustness Benchmarking: The paper demonstrates that while larger models (like large-v3) are inherently more robust to reverberation, even state-of-the-art models suffer performance drops in real-world acoustic environments. This highlights the necessity of explicit robustness testing.
Research Direction: The results suggest that future ASR development should focus on acoustic modeling and dereverberation front-ends, particularly for smaller, edge-deployable models which are highly sensitive to room acoustics.
Limitations & Extensions: The current benchmark is limited to English and uses a single RIR per utterance without additive noise. The authors encourage the community to extend the work to include multiple RIRs per utterance, other languages, and noisy-reverberant conditions.

In conclusion, Whisper-RIR-Mega provides a critical tool for the ASR community to move beyond clean-speech evaluation, offering a standardized, stratified, and reproducible method to measure and improve speech recognition in realistic acoustic environments.