SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Imagine you have a massive, world-class library (a Large Language Model, or LLM) that contains billions of books. This library is incredibly smart and can answer any question, but it's so huge that it requires a warehouse the size of a city to store it and a fleet of trucks just to move the books around. Most people can't afford to build or run such a library.

The goal of this paper is to shrink this library down to fit in a backpack without losing its ability to tell stories or solve problems. The authors call their new method SoLA.

Here is how SoLA works, explained through simple analogies:

1. The Problem: The "One-Size-Fits-All" Mistake

Previous attempts to shrink these libraries were like trying to cut a giant cake into equal slices for everyone.

Pruning (Cutting): Some people tried to just throw away random books. But they realized that if you cut out the wrong books, the library stops making sense. Also, modern libraries don't have "empty" shelves (zero activations) that are easy to skip, so this method is hard to use on standard computers.
Quantization (Compressing): Others tried to rewrite the books in a smaller font. This saves space, but you often have to hire a team of editors (expensive training) to fix the typos that appear.
Low-Rank Decomposition (Summarizing): This is like replacing a thick novel with a short summary. It saves space, but if you summarize everything equally, you lose the important details.

2. The SoLA Solution: The "Star Performer" Strategy

The authors of SoLA looked closely at how these libraries actually work and found a secret pattern: Not all books are created equal.

Step A: Finding the "Prime Neurons" (The Superstars)

Imagine the library has millions of "librarians" (neurons) who help process information.

The researchers discovered that only about 15% of these librarians do 95% of the heavy lifting. These are the "Prime Neurons." They are the superstars who know the most important facts.
The other 85% are "Marginal Neurons." They are still working, but they are mostly doing minor tasks or repeating things the superstars already said.

The SoLA Move: Instead of treating everyone the same, SoLA says, "Let's keep the 15% superstars exactly as they are. We won't touch them." This ensures the library keeps its core intelligence.

Step B: Compressing the Rest (The "Summary" Move)

Now, what about the other 85% of the librarians?

SoLA takes this group and compresses them using a mathematical trick called Low-Rank Decomposition. Think of this as taking a 500-page novel and turning it into a highly efficient 50-page summary.
Because these librarians weren't doing the heavy lifting anyway, summarizing them doesn't hurt the library's overall quality much.

Step C: The "Smart Tailor" (Adaptive Allocation)

Here is the final clever twist. Not all parts of the library need the same amount of compression.

Some sections of the library are very sensitive; if you summarize them too much, the story falls apart.
Other sections are robust; you can summarize them heavily, and they'll still make sense.

SoLA acts like a smart tailor. It measures exactly how sensitive each part of the library is.

For the sensitive parts, it gives them a "light summary" (keeps more details).
For the robust parts, it gives them a "heavy summary" (compresses them more).
This ensures that the total size is small, but the quality remains high.

3. The Results: A Backpack-Sized Genius

The authors tested SoLA on some of the biggest and most famous AI models (like LLaMA-2 and Mistral).

No Extra Training: Unlike other methods, SoLA doesn't need to re-teach the model. It just reorganizes the existing knowledge. It's like rearranging the books on the shelves rather than rewriting them.
Better than the Competition: When they compressed the massive 70-billion-parameter model by 30%, SoLA didn't just shrink it; it actually performed better than other compression methods.
- The Analogy: Imagine shrinking a 100-page encyclopedia down to 70 pages. Other methods made it hard to read (confusing). SoLA made it a 70-page version that was still easy to read and actually answered questions more accurately than the other compressed versions.

Summary

SoLA is a "training-free" compression tool that works by:

Identifying the VIPs: Keeping the most important parts of the AI untouched.
Summarizing the Rest: Compressing the less important parts efficiently.
Customizing the Fit: Adjusting how much to compress each part based on how sensitive it is.

The result is a smaller, faster, and cheaper AI that fits on your phone or laptop but still acts like a giant supercomputer.

1. Problem Statement

Large Language Models (LLMs) have achieved remarkable performance but suffer from massive parameter counts (billions to trillions), creating significant barriers for storage, computation, and deployment. Existing compression techniques face critical limitations:

Unstructured Pruning: Relies on activation sparsity (e.g., ReLU zeros), which is ineffective in modern LLMs using soft activation functions (SiLU, GeLU). It also lacks hardware support on commodity devices.
Structured Pruning: Removes entire channels or layers but often causes severe precision degradation, requiring expensive post-training (fine-tuning) to recover performance.
Quantization: Reduces memory via low-bit storage but typically requires fine-tuning for accuracy recovery.
Low-Rank Decomposition (e.g., SVD): Does not require special hardware or retraining but often suffers from high compression loss. Existing methods often ignore input/output data distributions and fail to account for the varying sensitivity of different model components (e.g., attention vs. feed-forward networks).

Goal: Develop a training-free, efficient, and affordable compression method that maintains high model quality without post-training.

2. Methodology: SoLA

SoLA (Soft activation sparsity and Low-rAnk decomposition) introduces a fine-grained compression framework based on three core pillars:

A. Discovery of Soft Activation Sparsity

The authors analyzed the activation patterns in the Feed-Forward Networks (FFN) of modern LLMs (LLaMA-2, Mistral). They found that while soft activations (SiLU/GeLU) do not produce exact zeros like ReLU, they exhibit a long-tail distribution in activation norms:

A small minority of neurons (termed "Prime Neurons") possess high activation norms and contribute disproportionately to model performance.
The majority of neurons ("Marginal Neurons") have low activation norms and contribute minimally.
Strategy: SoLA retains the Prime Neurons (e.g., top 15%) intact and only compresses the Marginal Neurons.

B. Soft Activation Sparsity Driven Decomposition

The method refines the standard low-rank decomposition process:

Calibration: Uses a small calibration dataset to compute input scaling matrices ( $S$ ) via Cholesky decomposition of $XX^T$ .
Neuron Grouping: Weights are partitioned based on the activation norms of the corresponding neurons.
- $W_\alpha$ : Weights for Prime Neurons (Retained without decomposition).
- $W_\beta$ : Weights for Marginal Neurons (Decomposed).
Decomposition: Only $W_\beta$ undergoes Singular Value Decomposition (SVD) after scaling ( $W_\beta S^{-1}_\beta = U \Sigma V$ ). This preserves the critical signal while reducing the dimensionality of the less important components.
Attention Modules: Since attention layers lack activation functions, the entire weight matrix in attention modules is decomposed, though specific projections (like $V$ in GQA architectures) are often excluded to prevent performance drops.

C. Adaptive Component-wise Low-Rank Allocation

Existing methods often apply a uniform truncation rank across all layers, ignoring that different components (e.g., Gate, Up, Down projections in FFN; Q, K, V, O in Attention) have different sensitivities to compression.

Optimization Problem: SoLA formulates rank allocation as an integer programming problem to maximize performance ( $f(r)$ ) within a memory budget ( $B$ ).
Heuristic Search: An adaptive greedy search algorithm dynamically assigns optimal truncation positions ( $r$ ) for each component.
Hardware Alignment: Truncation ranks are set to multiples of 16 to leverage NVIDIA hardware acceleration.

3. Key Contributions

Novel Training-Free Method: Introduced SoLA, which combines soft activation sparsity analysis with low-rank decomposition, eliminating the need for expensive post-training or fine-tuning.
Fine-Grained Decomposition: Proposed a strategy to identify and retain "Prime Neurons" while compressing "Marginal Neurons," addressing the data distribution issues ignored by previous SVD-based methods.
Adaptive Rank Allocation: Developed a component-wise allocation strategy that tailors the compression rank to the specific sensitivity of each weight matrix, significantly reducing reconstruction error.
Comprehensive Evaluation: Validated the method across multiple model families (LLaMA-2 7B/13B/70B, Mistral-7B) and benchmarks, demonstrating superior performance over state-of-the-art baselines.

4. Experimental Results

Experiments were conducted on LLaMA-2 (7B, 13B, 70B) and Mistral-7B across language modeling (WikiText2 perplexity) and downstream tasks (MMLU, BoolQ, PIQA, etc.).

Language Modeling (Perplexity):
- On LLaMA-2-70B with a 30% compression rate, SoLA reduced perplexity from 6.95 (SVD-LLM) to 4.44, outperforming all baselines.
- SoLA maintained low perplexity growth even as compression ratios increased to 50%, whereas baselines like LLM-Pruner degraded sharply.
Downstream Task Accuracy:
- SoLA achieved a 10% improvement in downstream task accuracy compared to state-of-the-art methods on LLaMA-2-70B (30% compression).
- It consistently outperformed structured pruning (LLM-Pruner, FLAP) and other decomposition methods (SVD-LLM, SliceGPT) without any fine-tuning.
Inference Efficiency:
- SoLA accelerated matrix multiplication speeds by 1.22× to 1.77× depending on the model size and compression ratio, leveraging dense kernel capabilities on standard hardware.
Robustness:
- The method showed high robustness to the quantity and type of calibration data (WikiText2 vs. C4), with performance variations under 10%.

5. Significance

SoLA represents a significant step forward in making Large Language Models accessible for deployment on resource-constrained devices.

Cost-Effective: By being training-free, it removes the computational cost and data requirements associated with fine-tuning compressed models.
Hardware Friendly: Unlike unstructured pruning, it produces dense matrices compatible with standard GPU kernels, ensuring real-world inference speedups.
Scalability: The method scales effectively from 7B to 70B parameter models, proving that fine-grained analysis of activation patterns can yield high-quality compression even for massive models.
Generalizability: The approach is applicable to various transformer architectures (including Grouped Query Attention in Mistral and LLaMA-2-70B), suggesting a universal solution for LLM compression.