Measuring the Redundancy of Decoder Layers in SpeechLLMs

Imagine you've just bought a massive, state-of-the-art super-kitchen (a Speech Large Language Model) to make delicious meals (speech recognition and translation). This kitchen is incredible, but there's a catch: 90% of the kitchen is just a giant, empty dining hall (the LLM decoder) where the food is plated and served. The actual cooking happens in a tiny prep area (the speech encoder).

The big question the researchers asked was: "Do we really need a dining hall that big? Or is most of it just empty space?"

Here is the story of their discovery, broken down into simple concepts:

1. The "Ghost" in the Machine

The researchers found that the giant dining hall wasn't built specifically for cooking speech. It was inherited from a text-only chef (a standard Large Language Model) that was already trained to write essays and chat.

The Analogy: Imagine you hire a famous novelist to write a cookbook. You realize the novelist is great at describing food, but they also brought their entire library of 50,000 books with them. The researchers discovered that the "extra books" (redundant layers) the novelist brought were the same whether they were writing a novel or a recipe. The structure of the "extra space" didn't change just because the input changed from text to speech.

2. The Great Pruning Experiment

To test how much space was actually wasted, the researchers started removing rooms from the dining hall. They didn't just knock down walls randomly; they used a special "similarity detector" (angular distance) to find which rooms were doing the exact same job as their neighbors.

The Result: They found that for the biggest kitchens (7–8 billion parameters), they could knock down nearly 40% of the rooms and the food still tasted just as good.
- Big Kitchens: Could lose ~40% of their size.
- Medium Kitchens: Could lose ~30%.
- Small Kitchens: Could only lose ~6% before the food started to taste bad.
- Lesson: The bigger the model, the more "fluff" it has.

3. The "Healing" Process (Crucial Step)

Here is the tricky part. When you remove a room, the hallway gets weird. The people walking in (the data) expect to meet the person in the next room, but now they are meeting someone three rooms down. If you just cut the wall, the service collapses.

The Fix: The researchers had to perform "surgery" to reconnect the hallway. They added a small, flexible bridge (called LoRA adapters) to the receiving room and adjusted the entrance ramp (the projector).
The Insight: If they only fixed the hallway, the food still tasted off. They had to fix both the hallway and the entrance ramp together. It's like realizing that if you move the kitchen counter, you also have to move the sink to keep the workflow smooth.

4. One Size Fits All (Speech to Translation)

The researchers then asked: "If we prune this kitchen for making English-to-English transcripts (ASR), will it still work for translating English to French (AST)?"

The Surprise: Yes! The exact same rooms that were useless for English speech were also useless for French translation.
The Analogy: It's like realizing that the "extra storage closet" in your house is useless whether you are cooking dinner or baking a cake. You don't need two different houses; you can just have one smaller, efficient house that does both jobs perfectly.

Why Does This Matter?

This discovery is a game-changer for two reasons:

Speed and Cost: By removing the "empty rooms," these AI models become 35% faster and use 35% less memory. This means they can run on cheaper, smaller computers (like laptops or phones) instead of massive supercomputers.
Universal Design: We don't need to build a unique, bloated AI for every single task (speech, translation, summarization). We can build one streamlined, "pruned" backbone that handles all of them efficiently.

The Bottom Line

The paper proves that our current AI models are like oversized suits: they look impressive, but they are full of extra fabric we don't need. By carefully cutting away the excess and stitching the seams back together, we can create lighter, faster, and cheaper AI that works just as well for listening to speech and translating languages.

Here is a detailed technical summary of the paper "Measuring the Redundancy of Decoder Layers in SpeechLLMs".

1. Problem Statement

Speech Large Language Models (SpeechLLMs) combine a speech encoder, a projector, and a pretrained LLM decoder to perform tasks like Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). A critical inefficiency in this architecture is that the LLM decoder typically accounts for over 90% of the total parameters, despite speech tasks historically being solvable with much smaller models.

The core research question is: How much of this massive decoder capacity is actually necessary for speech tasks? The authors hypothesize that significant redundancy exists within the decoder layers, similar to what has been observed in text-only LLMs, but this has not been systematically characterized for SpeechLLMs.

2. Methodology

The study employs a systematic approach to quantify redundancy and optimize model size through layer pruning.

Framework: The experiments utilize the SLAM (SpeechLLM) recipe, which consists of a speech encoder (WavLM Large or Whisper), a lightweight projector (MLP), and a frozen or LoRA-adapted LLM decoder.
Models & Scales: The study covers two LLM families (Qwen2.5 and Llama 3.1/3.2) across three parameter scales (1–1.5B, 3–4B, and 7–8B).
Redundancy Measurement (Angular Distance):
- Instead of training separate student models (as in knowledge distillation), the authors use angular distance between hidden states as a proxy for redundancy.
- For a block of $n$ layers starting at layer $\ell$ , they calculate the angular distance $d(h_\ell, h_{\ell+n})$ between the hidden states.
- Pruning Algorithm: They identify the optimal block of contiguous layers to remove by minimizing this angular distance. The layers are removed, and the output of the preceding layer is connected directly to the subsequent layer.
Post-Pruning Healing:
- Simply removing layers causes performance degradation because the receiving layer expects input from its immediate predecessor, not a distant one.
- To mitigate this, the authors introduce a healing mechanism using LoRA adapters on the receiving layer's MLP and/or unfreezing the projector. This allows the model to learn the residual corrections needed to adapt to the new input distribution.
Evaluation Metrics:
- ASR: Word Error Rate (WER) on LibriSpeech and the out-of-domain Loquacious dataset.
- AST: BLEU scores on CoVoST2 (En→De and Fr→En).
- Thresholds: Pruning tolerance is defined by relative degradation limits ( $\Delta \text{WER} \le 0.25$ for ASR; $\Delta \text{BLEU} \le 0.10$ for AST).

3. Key Contributions

Inheritance of Redundancy: The authors demonstrate that decoder redundancy in SpeechLLMs is largely inherited from the pretrained text-only LLM. The optimal blocks of layers to prune are nearly identical whether the input is text or speech, even before fine-tuning.
Scalability of Pruning: They quantify how redundancy scales with model size. Larger models (7–8B) exhibit significantly higher redundancy and pruning tolerance compared to smaller models.
Healing Dynamics: The study identifies that jointly adapting the projector and the decoder is critical for robust pruning. Healing only the decoder or only the projector yields suboptimal results.
Task and Modality Agnosticism: The redundancy structure is consistent across different tasks (ASR vs. AST), different source languages, and different speech encoders. This suggests a global redundancy structure that allows for a single pruned backbone to serve multiple tasks.

4. Experimental Results

A. Origin of Redundancy

Text vs. Speech: Heatmaps of angular distance show that the same layer blocks exhibit low distance (high similarity) in both text-only and speech-input scenarios.
LoRA Impact: Applying LoRA to the decoder amplifies the similarity of layer dynamics but does not disrupt the redundancy structure.
Location: Redundant layers are predominantly found in deeper blocks of the decoder, while the final layers remain critical.

B. ASR Performance (Pruning Tolerance)

Using the joint healing strategy (decoder + projector), the models retain acceptable performance with the following layer reductions:

7–8B Models: Can retain good performance with only ~60% of decoder layers (up to 43.8% removed).
- Example: Llama-3.1-8B removed 43.8% of layers, reducing peak GPU memory from 15.72 GiB to 10.37 GiB and achieving a 35% speedup.
3–4B Models: Can remove up to ~35% of layers.
1–1.5B Models: Have lower tolerance, removing only ~6–13% of layers before performance degrades significantly.
Out-of-Domain: Interestingly, pruning sometimes improves out-of-domain WER, likely due to a regularization effect where the projector re-aligns to a less domain-specific representation.

C. Generalization to Speech Translation (AST)

Cross-Task Consistency: The optimal pruning path derived for ASR is virtually identical to the optimal path for AST.
Performance: Applying the ASR-derived pruning path to AST tasks (Fr→En, En→De) yields BLEU scores nearly identical to those obtained using an AST-specific pruning path.
Implication: A single pruned backbone can support both ASR and AST, requiring only task-specific adapters (projectors/heads) rather than distinct decoder architectures.

5. Significance and Impact

Efficiency: The findings enable the deployment of significantly smaller, faster, and more memory-efficient SpeechLLMs without sacrificing performance. For large models, this translates to substantial reductions in inference latency and GPU memory footprint.
Unified Architecture: The discovery that redundancy is modality- and task-agnostic suggests that the community can move away from maintaining multiple specialized decoder backbones. Instead, a single, globally pruned backbone can be deployed for diverse speech tasks (ASR, AST, etc.), simplifying model management and deployment.
Scientific Insight: The work confirms that SpeechLLMs do not fully utilize their massive parameter budgets for speech tasks, as the "excess capacity" is structural and inherited from the text pretraining phase. This challenges the assumption that speech tasks require unique, non-redundant decoder architectures.

In conclusion, the paper provides a robust recipe for compressing SpeechLLMs by identifying and removing redundant decoder layers, proving that the massive scale of current LLM decoders is largely unnecessary for high-performance speech processing.