Measuring the Redundancy of Decoder Layers in SpeechLLMs

This study reveals that SpeechLLMs inherit significant redundancy from their pretrained LLM decoders, demonstrating that pruning up to 40% of decoder layers maintains strong performance across ASR and translation tasks while enabling a single, efficient backbone for multi-task deployment.

Adel Moumen, Guangzhi Sun, Philip C Woodland

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you've just bought a massive, state-of-the-art super-kitchen (a Speech Large Language Model) to make delicious meals (speech recognition and translation). This kitchen is incredible, but there's a catch: 90% of the kitchen is just a giant, empty dining hall (the LLM decoder) where the food is plated and served. The actual cooking happens in a tiny prep area (the speech encoder).

The big question the researchers asked was: "Do we really need a dining hall that big? Or is most of it just empty space?"

Here is the story of their discovery, broken down into simple concepts:

1. The "Ghost" in the Machine

The researchers found that the giant dining hall wasn't built specifically for cooking speech. It was inherited from a text-only chef (a standard Large Language Model) that was already trained to write essays and chat.

  • The Analogy: Imagine you hire a famous novelist to write a cookbook. You realize the novelist is great at describing food, but they also brought their entire library of 50,000 books with them. The researchers discovered that the "extra books" (redundant layers) the novelist brought were the same whether they were writing a novel or a recipe. The structure of the "extra space" didn't change just because the input changed from text to speech.

2. The Great Pruning Experiment

To test how much space was actually wasted, the researchers started removing rooms from the dining hall. They didn't just knock down walls randomly; they used a special "similarity detector" (angular distance) to find which rooms were doing the exact same job as their neighbors.

  • The Result: They found that for the biggest kitchens (7–8 billion parameters), they could knock down nearly 40% of the rooms and the food still tasted just as good.
    • Big Kitchens: Could lose ~40% of their size.
    • Medium Kitchens: Could lose ~30%.
    • Small Kitchens: Could only lose ~6% before the food started to taste bad.
    • Lesson: The bigger the model, the more "fluff" it has.

3. The "Healing" Process (Crucial Step)

Here is the tricky part. When you remove a room, the hallway gets weird. The people walking in (the data) expect to meet the person in the next room, but now they are meeting someone three rooms down. If you just cut the wall, the service collapses.

  • The Fix: The researchers had to perform "surgery" to reconnect the hallway. They added a small, flexible bridge (called LoRA adapters) to the receiving room and adjusted the entrance ramp (the projector).
  • The Insight: If they only fixed the hallway, the food still tasted off. They had to fix both the hallway and the entrance ramp together. It's like realizing that if you move the kitchen counter, you also have to move the sink to keep the workflow smooth.

4. One Size Fits All (Speech to Translation)

The researchers then asked: "If we prune this kitchen for making English-to-English transcripts (ASR), will it still work for translating English to French (AST)?"

  • The Surprise: Yes! The exact same rooms that were useless for English speech were also useless for French translation.
  • The Analogy: It's like realizing that the "extra storage closet" in your house is useless whether you are cooking dinner or baking a cake. You don't need two different houses; you can just have one smaller, efficient house that does both jobs perfectly.

Why Does This Matter?

This discovery is a game-changer for two reasons:

  1. Speed and Cost: By removing the "empty rooms," these AI models become 35% faster and use 35% less memory. This means they can run on cheaper, smaller computers (like laptops or phones) instead of massive supercomputers.
  2. Universal Design: We don't need to build a unique, bloated AI for every single task (speech, translation, summarization). We can build one streamlined, "pruned" backbone that handles all of them efficiently.

The Bottom Line

The paper proves that our current AI models are like oversized suits: they look impressive, but they are full of extra fabric we don't need. By carefully cutting away the excess and stitching the seams back together, we can create lighter, faster, and cheaper AI that works just as well for listening to speech and translating languages.