Demystifying When Pruning Works via Representation Hierarchies

Imagine a Large Language Model (LLM) as a super-smart, multi-story factory that turns a question (input) into an answer (output). The workers on each floor process the information, passing it up to the next floor until the final product is ready.

Network Pruning is like trying to make this factory cheaper and faster by firing some workers or closing down entire floors. The goal is to keep the factory running efficiently without ruining the quality of the products.

This paper asks a very specific question: Why does this "firing" strategy work great for some jobs but cause the factory to completely crash for others?

The Two Types of Jobs

The authors discovered that pruning works differently depending on what the factory is making:

The "Multiple Choice" Job (Non-Generative): Imagine a quiz show where the factory just has to pick the right answer from a list (A, B, C, or D).
- Result: Pruning works perfectly! Even if you fire 30% of the workers, the factory still picks the right answer.
The "Storytelling" Job (Generative): Imagine the factory has to write a novel, one word at a time, forever.
- Result: Pruning is a disaster. If you fire the same 30% of workers, the factory starts writing gibberish, repeating itself, or going off the rails after just a few sentences.

The Secret: Three Floors of the Factory

To understand why, the authors broke the factory down into three distinct "spaces" or floors where the information travels:

Floor 1: The Embedding Floor (The Raw Materials)
- Here, words are turned into numbers (vectors).
- The Finding: This floor is tough. Even if you remove workers, the raw materials still look almost exactly the same. The factory is very resilient here.
Floor 2: The Logit Floor (The Drafting Table)
- Here, the factory makes a rough guess about what comes next. It's like a "pre-score" before the final decision.
- The Finding: This floor is also resilient. The linear math used here actually smooths out the errors caused by firing workers. The rough drafts still look very similar to the original.
Floor 3: The Probability Floor (The Final Decision)
- Here, the rough guesses are converted into a final percentage chance (e.g., "There is a 90% chance the next word is 'cat'"). This uses a special, non-linear math trick called Softmax.
- The Finding: This is where the magic turns to disaster. The Softmax function acts like a magnifying glass or a soundboard with the volume turned to 11.
- A tiny, almost invisible error on Floor 2 gets blown up into a massive, catastrophic error on Floor 3.

The "Whisper vs. Shout" Analogy

Think of the error introduced by pruning as a whisper.

On the Embedding and Logit floors, that whisper is barely heard. It doesn't change the outcome.
But when that whisper hits the Probability floor (the Softmax), it gets amplified into a shout. Suddenly, the factory thinks "cat" is 99% likely, when it should have been 50%, or it thinks "dog" is impossible when it should be likely.

Why Generative Tasks Crash (The Domino Effect)

This is the most critical part of the paper.

In a Quiz (Non-Generative): The factory makes one decision at the end. It looks at the final shout, picks the loudest option, and says "Answer B." Even if the shout was slightly distorted, it's usually still loud enough to pick the right letter. The job is done in one step.
In a Story (Generative): The factory makes a decision, writes a word, and then feeds that word back into the machine to write the next one.
- Step 1: The factory makes a tiny mistake because of the "whisper" amplification. It writes the wrong word.
- Step 2: Because it wrote the wrong word, the next input is wrong. The factory is now working with bad data.
- Step 3: The error gets amplified again. The next word is even more wrong.
- Result: Within a few sentences, the story collapses into nonsense. It's like a game of "Telephone" where the message gets distorted at every turn, but because the distortion is amplified by the factory's own design, it happens incredibly fast.

The Takeaway

The paper concludes that pruning is safe for "one-shot" tasks (like answering a multiple-choice question or retrieving a document) because the errors don't have time to grow.

However, pruning is dangerous for "storytelling" tasks (like writing code, stories, or chat) because the factory's own mechanism for making decisions (Softmax) turns tiny mistakes into huge disasters, and the loop of generating text one word at a time lets those disasters compound instantly.

In short: You can fire workers to save money if the factory only has to make one quick choice. But if the factory has to build a long, complex tower brick by brick, firing workers will cause the whole tower to crumble.

1. Problem Statement

Network pruning is a widely used technique to improve the computational efficiency of Large Language Models (LLMs) by removing less important parameters or layers. However, empirical observations reveal a critical inconsistency:

Non-generative tasks (e.g., retrieval, multiple-choice classification) often retain high performance after aggressive pruning.
Generative tasks (e.g., text completion, code generation) frequently suffer catastrophic performance degradation or complete failure under similar pruning strategies.

Existing literature lacks a unified theoretical framework explaining why pruning behaves so differently across these task types. The paper aims to demystify this discrepancy by analyzing how pruning-induced perturbations propagate through the internal representation spaces of LLMs.

2. Methodology

The authors propose a representation-hierarchy perspective, decomposing the LLM inference pipeline into three sequential spaces:

Embedding Space ( $h$ ): Hidden representations within the transformer layers.
Logit Space ( $z$ ): Pre-softmax outputs (linear projections of hidden states).
Probability Space ( $p$ ): Post-softmax distributions over the vocabulary.

Key Analytical Approaches:

Empirical Visualization: The authors measure representation similarity (using Cosine Similarity) and distributional shift (using KL Divergence) between baseline and pruned models across different layers and generation steps. They test various pruning strategies, including layer dropping (inter-layer) and weight sparsification (intra-layer).
Theoretical Analysis: Using Taylor series expansions, the authors derive mathematical bounds for how perturbations ( $\Delta$ $Δ$ ) transform across spaces:
- Linear Transformations (Embedding $\to$ Logit): They prove that the linear projection (LM Head) inherently attenuates perturbations, preserving high similarity in the logit space.
- Nonlinear Transformations (Logit $\to$ Probability): They demonstrate that the Softmax function acts as an amplifier. Small deviations in logits ( $\Delta z$ ) are exponentially magnified in the probability space ( $\Delta p$ ), particularly when the variance of the logit perturbation is high.
Error Propagation Modeling: The paper analyzes autoregressive decoding, showing how errors introduced at early steps propagate and compound through the Key-Value cache and historical context, leading to a "feedback loop" of degradation in generative tasks.

3. Key Contributions

Identification of the Discrepancy: The paper formally establishes that pruning effectiveness is regime-dependent, succeeding in non-generative tasks but failing in generative ones.
Mechanism of Failure (The "Amplification" Effect): The authors identify the nonlinear mapping from logits to probabilities (Softmax) as the primary culprit. While pruning causes minor deviations in embeddings and logits, the softmax function amplifies these deviations, causing significant shifts in the output probability distribution.
Mechanism of Success (Robust Subspaces):
- Embedding/Logit Robustness: These spaces are naturally robust to pruning due to linear transformations and the nature of the LM Head.
- Subspace Stability: Non-generative tasks often rely on a small, fixed set of categorical tokens (e.g., A/B/C/D). Even if the global probability distribution shifts, the relative ordering of these specific candidate tokens often remains stable, preserving task performance.
Theoretical Framework: The paper provides theorems (Theorem 1, 2, and 3) quantifying how pruning-induced deviations scale with temperature ( $T$ ) and the variance of logit perturbations, offering a mathematical explanation for generation collapse.

4. Key Results

Performance Gap: Experiments on models like Mistral-7B, Llama-3, and Qwen-2.5 show that removing 8 layers (approx. 20-30% of parameters) maintains >90% performance on retrieval and multiple-choice benchmarks (e.g., MMLU, HellaSwag) but causes performance to drop to near 0% on generative benchmarks (e.g., GSM8K, HumanEval).
Representation Similarity:
- Embedding & Logit Spaces: Cosine similarity remains high (>0.9) even with significant pruning, indicating the internal representations are largely preserved.
- Probability Space: Cosine similarity drops sharply, and KL Divergence increases exponentially as generation steps progress.
Error Propagation: In generative tasks, the deviation between the baseline and pruned models grows with each generation step. The first step (prompt processing) is relatively stable, but subsequent steps diverge rapidly due to the compounding effect of sampling different tokens based on perturbed probabilities.
Subspace Analysis: In multiple-choice tasks, while the probability of top-ranked tokens shifts wildly, the log-likelihoods of the specific candidate tokens (A, B, C, D) often remain stable enough to select the correct answer.

5. Significance and Implications

Practical Guidance for Pruning: The findings suggest that pruning is safe and effective for single-step inference tasks (retrieval, classification) but carries high risk for autoregressive generation. Practitioners should not assume that high scores on non-generative benchmarks guarantee robustness in generative settings.
Design Principles: To prune generative models effectively, future methods must either:
- Focus on preserving the stability of the probability distribution (e.g., via fine-tuning or specialized loss functions).
- Avoid aggressive pruning that disrupts the variance structure required for stable softmax outputs.
Theoretical Insight: The work bridges the gap between model compression and representation learning, highlighting that the nonlinearity of the output layer is a critical bottleneck for compression in generative AI. It shifts the focus from "parameter count" to "representation hierarchy" when evaluating model robustness.

In summary, the paper argues that pruning works for non-generative tasks because they rely on stable, low-dimensional subspaces within the embedding and logit domains. Conversely, it fails for generative tasks because the autoregressive process amplifies the nonlinear distortions introduced by pruning in the probability space, leading to a cascade of errors.

Demystifying When Pruning Works via Representation Hierarchies

The Two Types of Jobs

The Secret: Three Floors of the Factory

The "Whisper vs. Shout" Analogy

Why Generative Tasks Crash (The Domino Effect)

The Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Fine-Tuning A Large Language Model for Systematic Review Screening

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Enhancing Structured Meaning Representations with Aspect Classification

Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining