Adaptive MLP Pruning for Large Vision Transformers

Imagine you have a giant, super-intelligent robot (a Large Vision Transformer) that can look at a picture and tell you exactly what's in it. This robot is incredibly smart, but it's also huge, heavy, and expensive to run. It eats up a lot of electricity and memory, making it hard to put inside a phone or a small computer.

The problem is that this robot is built with thousands of tiny workers (neurons). The paper discovers that about 80% of these workers are actually sitting around doing nothing or repeating the same tasks. They are "redundant."

The authors of this paper, Chengchao Shen and his team, came up with a clever way to fire the lazy workers without making the robot forget how to do its job. They call this method Adaptive MLP Pruning (AMP).

Here is how it works, broken down into simple steps:

1. The Problem with Old Methods: The "One-Question" Test

Imagine you want to find out which workers in a factory are essential.

Old Method: You ask each worker, "Can you identify this specific red apple?" If they say "Yes," they stay. If they say "No," you fire them.
The Flaw: This is too narrow! A worker might be terrible at spotting red apples but amazing at spotting blue cars or green trees. By only asking about the apple, you accidentally fire the expert on blue cars. This is what the paper calls using "one-hot cross entropy"—it ignores all the other possibilities.

2. The New Method: The "Full Picture" Test (Information Entropy)

The authors propose a better way to test the workers. Instead of asking about just one thing, they ask the robot to look at a picture and describe everything it sees, including how sure it is about each possibility.

The Analogy: Imagine a weather forecaster. The old method only checks if they correctly predicted "Rain." The new method checks their entire forecast: "30% chance of rain, 20% chance of sun, 50% chance of clouds."
Why it's better: This "Information Entropy" test looks at the whole picture. It captures the robot's full understanding of the world. This allows them to accurately identify which workers are truly essential and which ones are just copying others.

3. The "Goldilocks" Search (Adaptive Pruning)

Once they know which workers are the "stars" and which are "lazy," they need to decide how many to fire.

Old Method: "Let's fire exactly 40% of the workers, no matter what." This is risky. Maybe one department needs to keep 90% of its staff, while another can lose 90%. A flat rule hurts performance.
The New Method (Binary Search): Imagine you are trying to find the perfect amount of salt for a soup. You don't just guess. You taste, add a little, taste again, and adjust.
- The robot tries removing a few workers.
- It checks: "Is the soup still tasty?" (Did the robot's confidence drop too much?)
- If yes, it removes more. If no, it puts some back.
- It keeps doing this "taste test" until it finds the perfect balance where the robot is as small as possible but still just as smart. This is called "Adaptive" because it adjusts to the specific needs of each part of the robot.

4. The "Mentor" System (Knowledge Distillation)

After firing the lazy workers, the robot might feel a little confused or shaky. It's like a student who just lost their study group.

The Solution: The original, giant robot stays around to act as a Mentor (Teacher).
The smaller, trimmed robot (Student) tries to do the job, and the Mentor whispers, "No, look here, that's actually a cat, not a dog."
The Student learns from the Mentor's answers and quickly gets back to 100% performance, even though it has half the workers.

The Result: A Super-Compact Robot

By using this method, the authors managed to:

Cut the robot's size by 40% (fewer parameters).
Cut the energy cost by 40% (fewer calculations).
Keep the intelligence almost exactly the same (near lossless).

In fact, when they tested this on famous robots like CLIP and DINOv2, the trimmed-down versions were just as good at recognizing images as the original giants. In some cases, they were even slightly better!

In short: They found a way to trim the fat off giant AI models without cutting into the muscle, using a smart "taste test" to find the perfect size and a "mentor" to help the smaller model learn how to be just as smart as the big one.

Here is a detailed technical summary of the paper "Adaptive MLP Pruning for Large Vision Transformers":

1. Problem Statement

Large Vision Transformers (ViTs), such as CLIP and DINOv2, have demonstrated exceptional scalability and performance. However, their massive parameter counts lead to prohibitive computational and memory costs, hindering cost-effective deployment.

The Bottleneck: Analysis reveals that Multilayer Perceptron (MLP) modules dominate the parameter count in ViTs (e.g., constituting ~81.1% of parameters in EVA-CLIP-E).
Limitations of Existing Methods: Traditional Taylor-based pruning methods rely on one-hot cross-entropy loss to evaluate neuron importance. This approach has two critical flaws:
1. It ignores potential predictions for non-target categories, leading to inaccurate importance scores.
2. It requires access to the original model's loss function or specific training modules (e.g., the DINO head or text encoder), making it inapplicable to models where these components are not publicly available (e.g., DINOv2).
Rigid Compression: Most existing methods require a predefined compression ratio, failing to adapt to the varying redundancy levels across different MLP modules.

2. Methodology: Adaptive MLP Pruning (AMP)

The authors propose AMP, a three-stage framework designed to compress large ViTs with near-lossless performance.

A. Label-Free Importance Evaluation via Information Entropy

Instead of using one-hot cross-entropy, the authors introduce an Information Entropy criterion to evaluate the importance of hidden neurons in MLP modules.

Mechanism: They approximate the change in model prediction sensitivity using Taylor expansion. However, instead of a loss function, they use the entropy of the model's prediction distribution.
Label-Free Implementation: Since ground-truth labels or specific prediction heads (like DINO heads) may be unavailable, the method constructs a similarity matrix between image representations in a mini-batch.
- It computes cosine similarity between class tokens ( $z_{cls}$ ).
- A softmax operation converts these similarities into a pseudo-probability distribution.
- The information entropy of this distribution serves as the criterion ( $\mathcal{E}$ ).
Advantage: This allows for importance evaluation without labels, loss functions, or auxiliary modules, making it universally applicable to various pre-trained ViTs.

B. Adaptive Pruning via Binary Search

Rather than applying a fixed pruning ratio, AMP uses a binary search algorithm to determine the optimal number of neurons to prune for each MLP module.

Process:
1. Neurons are ranked by their calculated importance scores.
2. A binary search is performed on the range of possible hidden sizes $[0, M_0]$ .
3. For a candidate hidden size, the model's information entropy is evaluated on a small pruning dataset.
4. If the entropy increase ( $\Delta \mathcal{E}$ ) is within a predefined tolerance threshold, the search continues to prune more (reducing size); otherwise, it backtracks.
Outcome: This adaptively identifies the maximum redundancy that can be removed while keeping performance degradation within acceptable limits, avoiding the need for manual ratio tuning.

C. Knowledge Distillation for Recovery

To recover performance after structural changes:

The original model acts as the Teacher, and the pruned model as the Student.
Since only hidden layers are pruned, the output dimensions (class token and patch tokens) remain identical.
A Mean Squared Error (MSE) loss is applied directly between the teacher's and student's outputs (no alignment modules required) to transfer knowledge and restore accuracy.

3. Key Contributions

Label-Free Entropy Criterion: Introduced a novel importance evaluation metric based on information entropy derived from inter-instance similarity. This solves the "black-box" problem, enabling pruning of models with undisclosed loss functions or missing modules (e.g., DINOv2).
Adaptive Pruning Strategy: Developed a binary search-based approach that automatically determines the optimal pruning depth for each module based on redundancy, eliminating the need for predefined compression ratios.
Efficient Performance Recovery: Demonstrated that knowledge distillation can be applied directly due to structural affinity, achieving near-lossless recovery without complex alignment mechanisms.

4. Experimental Results

The method was evaluated on state-of-the-art large ViTs, including OpenCLIP-g/G, EVA-CLIP-E/8B, and DINOv2-g.

Compression Efficiency: Achieved roughly 40% reduction in both parameters and FLOPs across all tested models.
Inference Speed: Delivered approximately 1.5x inference acceleration (e.g., OpenCLIP-g throughput increased from 150.8 to 222.3 imgs/s).
Performance Recovery (Distilled):
- Distilled models recovered performance to match or slightly exceed the original models.
- Example: Distilled EVA-CLIP-8B achieved 82.9% average accuracy (vs. 82.9% original) with 40% fewer parameters.
- Example: Distilled DINOv2-g achieved 83.5% kNN accuracy (vs. 83.5% original) with only 54.4% of the original parameters.
Zero-Shot Performance (Without Fine-tuning):
- Even without knowledge distillation, the pruned models significantly outperformed other pruning methods.
- Example: On OpenCLIP-g, the pruned model (without distillation) achieved 53.8% average accuracy, compared to 11.1% for NViT and 9.6% for standard Taylor pruning.
Ablation Studies:
- Replacing the entropy criterion with cross-entropy dropped average accuracy from 53.8% to 50.0%.
- Replacing binary search with uniform pruning dropped accuracy from 53.8% to 7.3%, highlighting the necessity of adaptive pruning.

5. Significance

This paper addresses a critical bottleneck in deploying large-scale vision models: the trade-off between model capacity and computational cost.

Generalizability: By removing the dependency on labels and specific training heads, AMP makes pruning accessible for a wider range of proprietary or partially open-source models.
Adaptivity: The binary search mechanism ensures that compression is data-driven and model-specific, maximizing efficiency without manual hyperparameter tuning.
State-of-the-Art Results: The method sets a new benchmark for pruning large ViTs, achieving massive parameter reduction with negligible performance loss, even in zero-shot scenarios where the model is not retrained.

The authors conclude that this approach paves the way for efficient deployment of large vision transformers and plan to extend the method to Multi-Head Self-Attention (MSA) modules and Large Language Models (LLMs) in future work.