Addition is almost all you need: Compressing large language models with double binary factorization

Imagine you have a massive, incredibly detailed library of knowledge (a Large Language Model, or LLM). This library is so big that it takes up an entire warehouse, and moving books around inside it is slow and expensive. You want to shrink this library down so it fits in a backpack and can be read quickly, but you don't want to lose the stories inside.

This paper introduces a new way to shrink these libraries called Double Binary Factorization (DBF). Here is how it works, explained through simple analogies.

The Problem: The "Heavy" Library

Current AI models are like giant encyclopedias written in high-definition, full-color ink. Every word (or "weight" in the model) is a complex number that requires a lot of energy to read and store.

The old way to shrink them: People tried to turn the ink black and white (binary) or reduce the number of colors (quantization). But if you just turn everything black and white, the pictures look blurry and the stories make less sense.
The hardware issue: Computers are great at multiplying numbers, but it's like asking a chef to chop 1,000 onions with a diamond knife—it's precise but slow and energy-hungry.

The Solution: The "Double-Deck" Blueprint

The authors propose a clever trick. Instead of trying to shrink the whole library at once, they break every single book (weight matrix) into two smaller, simpler blueprints that, when put together, recreate the original book.

Think of it like this:

The Old Binary Method (OneBit): Imagine trying to describe a complex painting using only a single sheet of paper with black and white dots. It's fast, but the picture is very blocky and loses detail.
The New DBF Method: Imagine you have two sheets of paper.
- Sheet A is a grid of black and white dots (Binary).
- Sheet B is another grid of black and white dots.
- The Magic Glue: You also have two small rulers (scaling vectors) that tell you how "dark" or "light" to make the dots on each sheet.

When you stack Sheet A and Sheet B together and apply the rulers, they magically reconstruct the original high-definition painting.

Why is this "Double" better?

It's Smarter: Because you have two sheets instead of one, you can capture more detail. It's like having a stereoscopic 3D view instead of a flat 2D drawing.
It's Flexible: Most compression methods are like buying shoes that only come in whole sizes (Size 8, Size 9). If you need a Size 8.5, you're out of luck. DBF is like a stretchy, custom-fit shoe. You can adjust the "middle dimension" (the size of the gap between the two sheets) to get exactly the size you need, whether that's 1.5 bits or 2.3 bits.
It's Fast: The best part? Because the sheets are just black and white dots (+1 or -1), the computer doesn't need to do complex math. It just needs to add numbers.
- Analogy: Multiplying is like doing a complex dance routine. Adding is just walking in a straight line. DBF turns the dance into a walk, making the computer run 2 to 3.5 times faster while using much less battery power.

How they did it (The "Heuristic" Algorithm)

Finding the perfect two sheets to recreate the painting is a math nightmare (it's an "NP-hard" problem). The authors didn't solve the impossible; they used a smart "guess and check" method (a heuristic).

They started with a random guess.
They adjusted Sheet A, then Sheet B, then Sheet A again, over and over, getting closer to the perfect picture each time.
They also used a "importance map." If a part of the story is very important (like the climax of a book), they made sure the blueprints for that part were extra precise. If a part was less important, they compressed it more aggressively.

The Results

They tested this on famous AI models (Llama 2 and Llama 3).

Accuracy: At 2 bits per number (very small), DBF was just as good as the best existing methods, and sometimes better. At 1 bit (extremely small), it was significantly better than anything else.
Speed: On a standard high-end computer chip (RTX 4090), the model ran 2x to 3x faster than the original, un-compressed version.
Energy: Since it replaces heavy multiplication with simple addition, it saves a massive amount of energy, which is great for running AI on phones or laptops.

The Bottom Line

This paper says: "You don't need to keep the heavy, complex math to get smart AI."

By breaking the model's brain into two simple, binary layers and using a little bit of "glue" (scaling vectors), we can shrink AI models to fit in our pockets, make them run twice as fast, and save energy, all without losing the ability to write good stories or solve problems. It's like turning a heavy stone statue into a lightweight, foldable paper sculpture that looks exactly the same.

1. Problem Statement

Large Language Models (LLMs) face significant challenges regarding computational and storage requirements, hindering their deployment on resource-constrained hardware. While binary quantization (replacing weights with $\pm 1$ ) offers a path to efficiency by replacing costly multiplications with additions, it often suffers from severe accuracy degradation due to the extreme $\pm 1$ constraint. Existing state-of-the-art quantization methods (e.g., QuIP#, QTIP, AQLM) achieve high accuracy but typically require decompressing weights to full precision during inference, negating hardware acceleration benefits for low-precision arithmetic. Furthermore, most methods offer limited flexibility in compression ratios, often restricted to integer bit-widths.

2. Methodology: Double Binary Factorization (DBF)

The authors propose Double Binary Factorization (DBF), a novel compression technique that factorizes a dense weight matrix $W$ into the product of two binary sign matrices ( $A_{\pm 1}, B_{\pm 1}$ ) and three scaling vectors ( $a, m, b$ ).

Mathematical Formulation

Instead of a single binary matrix, DBF approximates $W$ as:
$W \approx (a \odot A_{\pm 1} \odot m^T)(B_{\pm 1} \odot b^T)$
Where:

$A_{\pm 1}$ and $B_{\pm 1}$ are matrices with entries $\in \{-1, 1\}$ .
$a, m, b$ are floating-point (FP16) scaling vectors.
$\odot$ denotes the element-wise (Hadamard) product.
The "middle dimension" $k$ (size of vector $m$ and inner dimension of matrices) controls the compression ratio.

Computation

During the forward pass, the operation $XW^T$ is computed as a sequence of additions and scaling:
$XW^T \approx ((((X \odot b^T)B_{\pm 1}^T) \odot m^T)A_{\pm 1}^T) \odot a^T$
This structure allows the model to utilize binary matrix multiplications (which are effectively additions) while maintaining high precision through the scaling vectors.

Optimization Algorithm

Finding the optimal DBF is likely NP-hard. The authors propose a heuristic algorithm based on Alternating Minimization and ADMM (Alternating Direction Method of Multipliers):

Initialization: Split the middle scaling factor and initialize matrices.
Alternating Steps: Fix $A$ and optimize $B$ , then fix $B$ and optimize $A$ .
Projection: The core subproblem involves projecting a matrix onto the set of matrices factorizable as $a \odot A_{\pm 1} \odot m^T$ . This is solved using Sign-Value-Independent Decomposition (SVID), a rank-1 approximation technique adapted from OneBit.
Importance Scaling: To preserve accuracy, the algorithm incorporates input activation norms (column importance) and gradient norms (row importance) to weight the approximation error, effectively approximating a diagonal Fisher matrix.

Fine-Tuning and Pruning

PV-Tuning: The authors adapt "PV-tuning" to fine-tune the discrete signs of the binary matrices by selecting random subsets of layers in each step to manage memory constraints.
Non-Uniform Compression: Unlike methods fixed to integer bits, DBF allows fine-grained control over the compression ratio by adjusting the middle dimension $k$ . The authors propose an iterative pruning algorithm where the middle dimension is treated as channels. Using channel pruning criteria based on weight importance, they dynamically allocate different compression ratios to different layers to minimize global perplexity.

3. Key Contributions

Novel Factorization Scheme: Introduced DBF, which decomposes weights into two binary matrices, offering a better trade-off between compression and accuracy than single-matrix binarization.
Fine-Grained Compression Control: DBF allows continuous adjustment of compression ratios (not limited to integer bits) by varying the intermediate dimension $k$ .
Hardware Efficiency: The method replaces multiplications with additions, enabling significant energy savings and speedups on hardware optimized for binary operations.
Iterative Layer-Wise Pruning: Developed an algorithm to assign non-uniform compression ratios across layers based on learned importance, improving overall model efficiency.
State-of-the-Art Performance: Demonstrated competitive or superior results compared to leading quantization methods (QuIP#, QTIP, AQLM) in the 1–2.3 bit range.

4. Experimental Results

The method was evaluated on Llama2-7B and Llama3-8B models.

Accuracy (Perplexity & Zero-Shot):
- 2-bit range: DBF is competitive with QTIP and QuIP# and slightly better than AQLM in some metrics. For Llama2-7B at 2 bits, DBF achieves a WikiText PPL of 6.09 (with PV-tuning), comparable to QTIP (5.86) and better than QuIP# (6.19).
- 1-bit range: DBF significantly outperforms existing binary methods. On Llama2-7B, DBF achieves a PPL of 8.76 (vs. OneBit's 9.73 and BiLLM's 32.48).
- Non-Uniform Compression: Applying iterative pruning to Llama3-8B reduced perplexity from 7.30 to 7.26 at an average of ~2.1 bits/weight.
Inference Speed:
- Matrix Multiplication: DBF is 2.1x to 3.5x faster than FP16 baselines for 2-bit compression and up to 6.5x faster for 1-bit compression on an RTX 4090.
- Decoding Throughput: DBF achieves 2.0x to 2.9x speedup in token generation throughput compared to dense FP16 models.
Properties:
- Weight Importance: Experiments show DBF successfully reduces approximation error for high-importance weights, unlike standard scalar quantization or OneBit.
- Scalability: The method scales effectively to larger models (up to Llama3.1-405B) without degradation in relative approximation error.

5. Significance and Conclusion

This paper presents a breakthrough in LLM compression by demonstrating that binary factorization can rival high-precision quantization methods while retaining the computational benefits of binary arithmetic.

Efficiency: By avoiding the need to decompress weights to full precision during inference, DBF unlocks true hardware acceleration for low-bit models, offering substantial energy savings.
Flexibility: The ability to tune compression ratios continuously (via the middle dimension) and apply non-uniform pruning across layers addresses a major limitation of current quantization techniques.
Practicality: The method yields significant real-world speedups (2–3.5x) on current GPU hardware, making it a viable solution for deploying large models on edge devices or cost-sensitive cloud environments.

The authors conclude that while fine-tuning binary matrices remains a challenge, DBF represents a robust, flexible, and highly efficient alternative to traditional quantization, effectively proving that "addition is almost all you need" for high-performance LLM compression.