High-Fidelity Pruning for Large Language Models
This paper proposes High-Fidelity Pruning (HFPrune), a method that utilizes the information entropy of a model's output distribution to evaluate neuron importance during Taylor-based pruning, thereby overcoming the limitations of standard cross-entropy criteria and the computational overhead of self-distillation to achieve superior performance on LLaMA and Qwen models without requiring an additional teacher model.