High-Fidelity Pruning for Large Language Models

Imagine you have a brilliant, overworked chef (a Large Language Model) who can cook almost any dish in the world. This chef is incredibly talented but requires a massive kitchen, a huge team of assistants, and mountains of ingredients to operate. You want to shrink this kitchen down to fit in a tiny apartment (your phone or laptop) without losing the chef's ability to cook delicious meals.

This is the problem of pruning: cutting down the size of AI models to make them faster and cheaper to run.

The paper you shared introduces a new, smarter way to do this cutting, called HFPrune (High-Fidelity Pruning). Here is how it works, explained through simple analogies.

1. The Problem: The "One-Note" Critic

For a long time, scientists tried to figure out which parts of the AI to cut by using a method called Taylor Pruning. Think of this method as having a very strict, narrow-minded food critic.

The Old Way (Cross-Entropy): This critic only cares if the chef gets the one specific dish right that the customer ordered. If the customer asked for "Spaghetti," the critic only checks if the chef made Spaghetti.
The Flaw: If the chef was also thinking about making "Lasagna" or "Ravioli" as backup options, the old critic ignores those thoughts entirely. When you cut out a chef's assistant based on this critic's advice, you might accidentally remove the person who was great at making Lasagna, even though the chef didn't order it this time. The result? The chef loses their versatility and creativity.

2. The Solution: The "Whole Menu" Critic

The authors of this paper say, "Let's stop looking at just one dish. Let's look at the entire menu the chef is thinking about."

They propose a new method based on Information Entropy.

The New Way (Information Entropy): Instead of a narrow critic, imagine a wise mentor who looks at the chef's entire mental state. They ask: "If we remove this assistant, how much does the chef's entire list of possible dishes change?"
The Goal: The goal isn't just to keep the "Spaghetti" prediction perfect. The goal is to keep the shape of the chef's entire thought process intact. If the chef was 40% sure of Spaghetti, 30% of Lasagna, and 30% of Ravioli, we want to make sure that after we fire some assistants, those percentages stay roughly the same.

3. Why This is a Big Deal

The paper highlights two main advantages of this "Whole Menu" approach:

No Extra Teacher Needed: Some other methods try to fix the "One-Note" problem by hiring a second, super-expensive "Teacher Chef" to supervise the cutting process. This is slow and expensive. The new method (HFPrune) is like the chef teaching themselves; it uses the chef's own internal logic to decide who to keep, saving time and money.
Better Results: Because it respects the chef's full range of thoughts, the final "small kitchen" version of the AI is much more accurate. In their tests, they cut 20% to 30% of the AI's brain (specifically the "MLP" parts, which are like the chef's memory and reasoning centers) and the AI actually performed better than the original after a tiny bit of practice.

4. The Analogy of the Orchestra

Think of a Large Language Model as a massive orchestra.

The Old Method: The conductor asks, "Who is playing the violin solo right now?" If a violinist stops playing, the conductor only checks if the solo is still perfect. They might fire a cellist who was playing a background note, not realizing that the cellist was actually holding the whole song together.
The New Method (HFPrune): The conductor listens to the entire symphony. They ask, "If we remove this musician, does the harmony of the whole song change?" They only fire the musicians whose absence changes the music the least. The result is a smaller orchestra that still sounds just as rich and complex as the big one.

The Bottom Line

The authors created a tool called HFPrune that allows us to shrink massive AI models significantly (making them faster and cheaper) without making them "dumb." By looking at the AI's entire world of possibilities rather than just a single answer, they preserve the AI's "fidelity" (its true nature and intelligence).

It's like shrinking a giant library down to a backpack size, but ensuring that every book inside still tells the whole story, not just the first sentence.

Here is a detailed technical summary of the paper "High-Fidelity Pruning for Large Language Models" (HFPrune).

1. Problem Statement

Large Language Models (LLMs) face significant barriers to deployment due to their massive computational and memory footprints. While Multi-Layer Perceptron (MLP) modules constitute the majority of parameters (e.g., ~68.3% in LLaMA2-7B) and offer the best opportunity for compression, existing pruning methods suffer from critical limitations:

Reliance on One-Hot Cross-Entropy: Traditional Taylor-based pruning methods estimate neuron importance using the gradient of a one-hot cross-entropy loss. This criterion focuses exclusively on the probability of the single ground-truth token, ignoring the model's distribution over all other potential tokens. Consequently, pruning guided by this metric preserves only label-specific predictions, potentially destroying the rich, holistic knowledge encoded in the model's full output distribution.
Limitations of Self-Distillation: An intuitive alternative is self-distillation (using the model as its own teacher), but this introduces significant computational overhead (requiring a separate teacher model) and suffers from a "zero-gradient" problem at initialization, where the initial distillation loss is zero, providing no signal for importance scoring.

2. Methodology: HFPrune

The authors propose HFPrune, a structured pruning method that targets MLP hidden neurons. The core innovation is replacing the one-hot cross-entropy criterion with Information Entropy of the model's global prediction distribution.

A. Core Criterion: Information Entropy

Instead of minimizing the error on a single target token, HFPrune minimizes the change in the global prediction distribution.

Definition: For an input $x$ , the criterion $C_H(x)$ is defined as the information entropy of the model's prediction distribution $P = \{p_1, ..., p_V\}$ over the vocabulary size $V$ :
$C_H(x) = -\sum_{j=1}^{V} p_j(x) \log_2 p_j(x)$
Advantage: This is a label-free criterion that captures the model's predictive confidence across the entire vocabulary, modeling "holistic predictions" rather than just the correct label.

B. Importance Scoring via Taylor Expansion

The method uses a first-order Taylor expansion to estimate the impact of removing a specific neuron $h_i$ on the entropy criterion:

Gradient Calculation: Compute the gradient of the entropy loss with respect to the hidden neuron activation: $\nabla_{h_i} C_H(x)$ .
Importance Score: The importance score $I_i$ for neuron $i$ is the magnitude of the change in entropy caused by ablating the neuron, averaged over a calibration dataset $D_{calib}$ :
$I_i = \frac{1}{|D_{calib}|} \sum_{x \in D_{calib}} \left| \frac{\partial C_H(x)}{\partial h_i(x)} h_i(x) \right|$
Pruning: Neurons with the lowest scores are removed (structural pruning of rows in $W_{up}, W_{gate}$ and columns in $W_{down}$ ).

C. Recovery

After pruning, the model undergoes a brief fine-tuning phase (2 epochs using LoRA on the LaMini-instruction dataset) to restore performance.

3. Key Contributions

Novel Label-Free Criterion: Introduced an information entropy-based criterion for Taylor-based pruning. It avoids the need for ground-truth labels and eliminates the zero-gradient initialization issue found in self-distillation methods.
Holistic Knowledge Preservation: By minimizing the change in the global prediction distribution rather than just the label-related prediction, the method preserves the model's intrinsic knowledge and distributional integrity more effectively.
Efficiency: The method is computationally efficient, requiring no separate teacher model and avoiding the overhead associated with self-distillation approaches.
Superior Performance: Demonstrated consistent outperformance over state-of-the-art methods (LLM-Pruner, LoRAPrune, SDMPrune) across LLaMA and Qwen series models.

4. Experimental Results

The method was evaluated on LLaMA-2-7B, LLaMA-3.2 (1.2B/3.2B), and Qwen series models across 10 zero-shot benchmarks (e.g., ARC, BoolQ, PIQA).

Performance Recovery:
- On LLaMA-2-7B with 20% pruning, HFPrune achieved an average accuracy of 59.0%, surpassing the original dense model (58.3%) and the second-best method (SDMPrune, 58.2%) by 0.8% and 0.7% respectively.
- At 30% pruning, HFPrune (56.3%) significantly outperformed SDMPrune (55.6%) and other baselines.
- Similar gains were observed on smaller models (LLaMA-3.2-1.2B/3.2B) and Qwen models (Qwen2.5-1.5B/7B, Qwen3-1.7B).
Distribution Fidelity:
- Jensen-Shannon (JS) Distance: HFPrune maintained a lower JS distance (closer to the original distribution) compared to Cross-Entropy pruning, especially at aggressive 30% pruning ratios.
- Top-15 Jaccard Similarity: HFPrune better preserved the original model's top probable tokens.
Efficiency Gains:
- Inference Speed: Pruning 30% of MLP layers resulted in a 1.47× speedup in prefill latency and a 35.8% increase in decoding throughput on an NVIDIA A6000 GPU.
- Pruning Overhead: The pruning process itself was 3× faster and used 31% less GPU memory compared to SDMPrune.
Ablation Studies:
- Criterion: Information Entropy outperformed Cross-Entropy and Self-Distillation criteria even without fine-tuning.
- Target: Pruning only MLP modules yielded better results than pruning both Attention and MLP modules, suggesting MLPs contain more recoverable distributed knowledge.

5. Significance

HFPrune addresses a fundamental flaw in existing Taylor-based pruning: the myopic focus on single-token accuracy. By shifting the objective to global distribution preservation via information entropy, the method achieves High-Fidelity compression.

Practical Impact: It enables the deployment of smaller, faster LLMs that not only match but can exceed the performance of their dense counterparts after minimal fine-tuning.
Scalability: The label-free nature and computational efficiency make it highly scalable for future large-scale model compression tasks, including potential extensions to quantization and other architectures.

In summary, HFPrune represents a paradigm shift from "label-centric" to "distribution-centric" pruning, offering a robust, efficient, and high-performance solution for compressing Large Language Models.