Demystifying When Pruning Works via Representation Hierarchies

This paper explains why network pruning succeeds in non-generative tasks but fails in generative settings by demonstrating that while embedding and logit representations remain robust to pruning, the nonlinear transformation to probabilities amplifies perturbations, causing error accumulation during generation.

Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li

Published 2026-03-27
📖 5 min read🧠 Deep dive

Imagine a Large Language Model (LLM) as a super-smart, multi-story factory that turns a question (input) into an answer (output). The workers on each floor process the information, passing it up to the next floor until the final product is ready.

Network Pruning is like trying to make this factory cheaper and faster by firing some workers or closing down entire floors. The goal is to keep the factory running efficiently without ruining the quality of the products.

This paper asks a very specific question: Why does this "firing" strategy work great for some jobs but cause the factory to completely crash for others?

The Two Types of Jobs

The authors discovered that pruning works differently depending on what the factory is making:

  1. The "Multiple Choice" Job (Non-Generative): Imagine a quiz show where the factory just has to pick the right answer from a list (A, B, C, or D).
    • Result: Pruning works perfectly! Even if you fire 30% of the workers, the factory still picks the right answer.
  2. The "Storytelling" Job (Generative): Imagine the factory has to write a novel, one word at a time, forever.
    • Result: Pruning is a disaster. If you fire the same 30% of workers, the factory starts writing gibberish, repeating itself, or going off the rails after just a few sentences.

The Secret: Three Floors of the Factory

To understand why, the authors broke the factory down into three distinct "spaces" or floors where the information travels:

  1. Floor 1: The Embedding Floor (The Raw Materials)
    • Here, words are turned into numbers (vectors).
    • The Finding: This floor is tough. Even if you remove workers, the raw materials still look almost exactly the same. The factory is very resilient here.
  2. Floor 2: The Logit Floor (The Drafting Table)
    • Here, the factory makes a rough guess about what comes next. It's like a "pre-score" before the final decision.
    • The Finding: This floor is also resilient. The linear math used here actually smooths out the errors caused by firing workers. The rough drafts still look very similar to the original.
  3. Floor 3: The Probability Floor (The Final Decision)
    • Here, the rough guesses are converted into a final percentage chance (e.g., "There is a 90% chance the next word is 'cat'"). This uses a special, non-linear math trick called Softmax.
    • The Finding: This is where the magic turns to disaster. The Softmax function acts like a magnifying glass or a soundboard with the volume turned to 11.
    • A tiny, almost invisible error on Floor 2 gets blown up into a massive, catastrophic error on Floor 3.

The "Whisper vs. Shout" Analogy

Think of the error introduced by pruning as a whisper.

  • On the Embedding and Logit floors, that whisper is barely heard. It doesn't change the outcome.
  • But when that whisper hits the Probability floor (the Softmax), it gets amplified into a shout. Suddenly, the factory thinks "cat" is 99% likely, when it should have been 50%, or it thinks "dog" is impossible when it should be likely.

Why Generative Tasks Crash (The Domino Effect)

This is the most critical part of the paper.

  • In a Quiz (Non-Generative): The factory makes one decision at the end. It looks at the final shout, picks the loudest option, and says "Answer B." Even if the shout was slightly distorted, it's usually still loud enough to pick the right letter. The job is done in one step.
  • In a Story (Generative): The factory makes a decision, writes a word, and then feeds that word back into the machine to write the next one.
    • Step 1: The factory makes a tiny mistake because of the "whisper" amplification. It writes the wrong word.
    • Step 2: Because it wrote the wrong word, the next input is wrong. The factory is now working with bad data.
    • Step 3: The error gets amplified again. The next word is even more wrong.
    • Result: Within a few sentences, the story collapses into nonsense. It's like a game of "Telephone" where the message gets distorted at every turn, but because the distortion is amplified by the factory's own design, it happens incredibly fast.

The Takeaway

The paper concludes that pruning is safe for "one-shot" tasks (like answering a multiple-choice question or retrieving a document) because the errors don't have time to grow.

However, pruning is dangerous for "storytelling" tasks (like writing code, stories, or chat) because the factory's own mechanism for making decisions (Softmax) turns tiny mistakes into huge disasters, and the loop of generating text one word at a time lets those disasters compound instantly.

In short: You can fire workers to save money if the factory only has to make one quick choice. But if the factory has to build a long, complex tower brick by brick, firing workers will cause the whole tower to crumble.