HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models

Imagine you have a brilliant, overworked assistant named VLM (Vision-Language Model). This assistant is incredibly smart: it can look at a photo and write a story about it, or answer complex questions based on what it sees. But there's a catch: this assistant is huge. It takes up a massive amount of computer memory and runs very slowly, making it hard to put on a regular phone or laptop.

To fix this, people usually try to "prune" the assistant—basically, they fire some of its neurons (its brain cells) to make it smaller and faster.

The Problem:
The old way of firing neurons was like a random firing squad. You'd just say, "Fire 20% of the staff!" without thinking about who you were firing.

The Result: The assistant might still be fast, but it starts hallucinating. It might look at a picture of a dog and confidently say, "I see a cat!" because it fired the specific neurons that were good at recognizing animals. It became efficient but unreliable.

The Solution: HiPP-Prune
The authors of this paper created a new system called HiPP-Prune. Think of it as a smart, strategic HR manager who doesn't just fire people randomly, but carefully reorganizes the team based on what the company needs right now.

Here is how it works, using simple analogies:

1. The "Menu" of Priorities (Preference-Conditioned)

Imagine you are ordering a meal, but instead of picking one dish, you tell the chef: "I want 70% taste, 20% health, and 10% speed."

Old Way: The chef just shrinks the whole meal by 20% randomly. You lose flavor and nutrition.
HiPP-Prune: The chef (the AI policy) looks at your specific "preference menu."
- If you say, "I care most about not lying (Robustness)," the chef keeps the neurons that are good at checking facts, even if it means the model is slightly slower.
- If you say, "I care most about speed (Compression)," the chef cuts the heavy parts but tries to keep the core logic intact.
- The Magic: You only need to train one "Chef." You can ask for different menus later without retraining the whole kitchen.

2. The "Visual Radar" (Visual Sensitivity)

This is the most important part. In a normal assistant, all neurons are treated the same. But in a Vision-Language Model, some neurons are the "eyes" and some are the "mouth."

The Analogy: Imagine a detective (the model) looking at a crime scene photo.
- The "Mouth" neurons write the report.
- The "Eye" neurons actually look at the photo to see the gun, the blood, or the suspect.
The Mistake: Old pruning methods might accidentally fire the "Eye" neurons to save space. The detective then writes a report based on nothing but guesses, leading to hallucinations.
HiPP-Prune's Fix: The system has a Visual Radar. It knows exactly which neurons are looking at the image. When it needs to cut staff, it puts a "Do Not Fire" sticker on the "Eye" neurons. It protects the visual grounding so the model doesn't start making things up.

3. The "Architect's Blueprint" (Hierarchical Planning)

Instead of firing one neuron at a time (which is slow and chaotic), HiPP-Prune draws a blueprint in one go.

It decides: "We need to cut 30% of the total staff." (Global Budget).
Then it decides: "Cut 50% from the math department, but only 10% from the art department." (Layer Allocation).
This happens instantly, creating a perfect plan that balances the load.

4. The "Safety Net" (SynFlow Stability)

Sometimes, when you try to cut too much, the building collapses.

The Analogy: If you remove too many support beams from a house, it falls down, even if you kept the "Eye" neurons.
HiPP-Prune's Fix: It uses a "Stability Gate" (inspired by a concept called SynFlow). Before it finalizes a plan, it checks: "If we cut this much, will the house still stand?" If the plan looks like it will cause a collapse (the model stops working), the system says, "Nope, try a different plan," and ignores that bad idea. This stops the AI from wasting time on impossible solutions.

5. The "Tune-Up" (Post-Pruning Recovery)

Even with the best HR manager, firing people causes a little chaos. The remaining team needs a quick meeting to get back in sync.

The paper uses a lightweight "tune-up" (fine-tuning) after pruning. This is like a quick workshop where the remaining neurons relearn how to work together.
The Result: Because HiPP-Prune fired the right people and kept the right people, this tune-up is very effective. The model comes back faster, smarter, and much less likely to hallucinate than models pruned by other methods.

The Bottom Line

HiPP-Prune is like a smart, customizable shrink-ray for AI models.

It doesn't just make the model smaller; it makes it smarter for the specific job you need.
It protects the "eyes" so the model doesn't lie about what it sees.
It lets you dial in the perfect balance between Speed, Accuracy, and Honesty with a single click, without needing to rebuild the model from scratch.

In experiments, this method proved that by being strategic about where you cut, you can get a tiny, fast model that is still incredibly reliable and doesn't start making up fake facts.

1. Problem Statement

Vision-Language Models (VLMs) are powerful but computationally expensive, necessitating model pruning for efficient deployment. However, standard pruning methods face two critical challenges specific to VLMs:

Differential Impact on Robustness vs. Utility: At the same sparsity level, conventional pruning often preserves task utility (e.g., question answering) while severely degrading hallucination robustness (the ability to avoid describing objects not present in the image).
Non-Uniform Layer Sensitivity: Pruning decisions are highly non-uniform across layers. Blindly applying global sparsity or fixed heuristics can disrupt cross-modal fusion (the interaction between vision and language tokens), leading to "object hallucinations."
Rigid Trade-offs: Existing methods typically optimize for a single fixed objective, failing to provide a flexible mechanism for users to navigate the trade-off space between robustness, utility, and compression based on specific deployment constraints.

2. Methodology: HiPP-Prune

HiPP-Prune reframes VLM pruning as a conditional resource allocation problem. Instead of optimizing a single scalar objective, it learns a hierarchical policy that generates a global pruning blueprint based on a user-specified preference vector.

A. Core Framework

Hierarchical Policy: The policy makes a single "plan-level" decision rather than incremental token-level actions. It factorizes the decision into:
1. Global Sparsity Control: A scalar determining the overall compression budget.
2. Layer-wise Allocation: A vector distributing that budget across specific layers.
Preference Conditioning: The policy takes a preference vector $w = [w_{robustness}, w_{utility}, w_{compression}]$ as input, allowing it to output different pruning plans for different deployment needs without retraining.

B. Vision-Aware State Representation

To prevent the degradation of visual grounding, the policy state includes a Visual Sensitivity Signal:

Mechanism: It calculates the cross-modal attention mass between language tokens and vision tokens in the transformer blocks.
Feature: Layers with high attention flow from language to vision tokens are identified as critical for grounding. The policy uses this signal to protect these layers from aggressive pruning, ensuring the model retains the ability to "see" what it is describing.
Implementation: This signal is pre-computed on a calibration set and treated as a static feature during training to avoid RL overhead.

C. Optimization Strategy: Plan-Level GRPO

The framework uses Group Relative Policy Optimization (GRPO) to optimize the pruning plans:

Multi-Objective Reward: The reward function combines:
- Robustness: Measured via POPE (Polling-based Object Hallucination Evaluation) margins.
- Utility: Measured via ScienceQA accuracy.
- Compression: A linear reward based on meeting the target sparsity budget.
SynFlow-Inspired Stability Gate: High-sparsity exploration often leads to "non-viable" network topologies (catastrophic collapse). HiPP-Prune employs a stability gate based on SynFlow (a measure of synaptic flow). If a candidate plan disrupts the flow too much, its update weight is downweighted. This stabilizes the search in high-sparsity regimes without treating stability as a primary user objective.

D. Post-Pruning Recovery

To ensure fair comparison and structural integrity, all pruned models undergo a lightweight recovery fine-tuning stage (using LoRA) with a fixed sparsity mask. This acts as a probe: if a pruning plan preserves a structurally favorable subnetwork, it will recover to higher performance than a plan that destroys critical pathways.

3. Key Contributions

Hierarchical Preference-Conditioned Policy: A novel framework that treats pruning as a resource allocation task, enabling a single agent to generate diverse, Pareto-efficient pruning plans based on user preferences (robustness vs. utility vs. compression).
Attention-Flow Visual Sensitivity: The integration of cross-modal attention flow into the policy state to explicitly protect layers critical for visual grounding, directly addressing the hallucination problem in compressed VLMs.
Plan-Level GRPO with Stability Gating: An extension of GRPO to the combinatorial space of pruning plans, utilizing a SynFlow-inspired gate to filter out unstable high-sparsity configurations, ensuring robust convergence.
Empirical Validation: Demonstrated that HiPP-Prune discovers superior pruning plans compared to state-of-the-art baselines (Wanda, LLM-Pruner, SliceGPT) under matched sparsity and recovery budgets.

4. Experimental Results

The method was evaluated on LLaVA-1.5-7B and Qwen2.5-VL-3B using POPE (for hallucination robustness) and ScienceQA (for task utility).

Performance Gains: At a matched sparsity of ~22.5%, HiPP-Prune achieved a POPE Balanced Accuracy of 72.89% on LLaVA-7B, significantly outperforming the best baseline (SliceGPT at 52.56%) and the unpruned model's robustness drop. It also improved ScienceQA accuracy to 39.38% (vs. 37.75% for SliceGPT).
Zero-Shot Controllability: A single trained policy successfully navigated the trade-off space. By adjusting the preference vector $w$ , users could shift the model's behavior to prioritize robustness (higher POPE scores) or utility (higher ScienceQA scores) without retraining.
Scalability: The method maintained its advantage even at higher sparsity levels (~32.5%), showing that the learned allocation strategy generalizes well under tighter compression constraints.
Ablation Studies: Confirmed that the hybrid preference sampling strategy (combining discrete anchors and Dirichlet distributions) yields the most stable robustness and utility trade-offs.

5. Significance

HiPP-Prune represents a paradigm shift in model compression for multimodal systems:

From Heuristic to Adaptive: It moves away from fixed, heuristic pruning rules toward adaptive, preference-driven allocation.
Addressing the "Hallucination" Gap: It is one of the first methods to explicitly treat hallucination robustness as a primary optimization objective during the structural pruning phase, rather than a post-hoc diagnostic.
Deployment Flexibility: By learning a single policy that covers the entire Pareto front, it offers a practical "query-once" mechanism for diverse deployment scenarios (e.g., edge devices needing high compression vs. safety-critical applications needing high robustness).
Structural Integrity: The use of SynFlow-inspired gating ensures that the search for optimal sparsity does not destroy the fundamental information flow of the network, a critical factor often overlooked in high-sparsity pruning.

In summary, HiPP-Prune provides a robust, controllable, and efficient framework for deploying VLMs, ensuring that compression does not come at the cost of the model's ability to accurately ground its language in visual reality.