Vector-Quantized Soft Label Compression for Dataset Distillation

Imagine you are a master chef (the Teacher) who has spent years perfecting a massive cookbook containing millions of recipes. You want to teach a young apprentice (the Student) how to cook, but you can't give them the entire library. It's too heavy to carry, and the apprentice's kitchen is too small to store it.

Dataset Distillation is the art of shrinking that massive cookbook down to a tiny, super-condensed "cheat sheet" of just a few dozen perfect recipes that still teaches the apprentice everything they need to know.

The Problem: The "Secret Sauce" is Too Heavy

In modern machine learning, the "cheat sheet" isn't just the pictures of the food (the data); it's also the soft labels.

Think of a soft label as a detailed, multi-page critique from the Master Chef. Instead of just saying, "This is a burger," the Chef writes: "This is 80% burger, 15% sandwich, and 5% pizza, with a hint of nostalgia."

To make the apprentice learn really well, the Master Chef doesn't just write one critique per photo. They write hundreds of critiques for every single photo, imagining the food in different lights, angles, and weather conditions (these are called augmentations).

Here's the catch: While the photos (the data) are small, these thousands of detailed critiques (the soft labels) take up more space than the photos themselves. It's like trying to carry a suitcase full of photos, but the weight is actually in the massive, heavy commentaries attached to them. In huge datasets (like ImageNet-1K), these commentaries become so heavy that they crush the whole system, making it impossible to share or store the "cheat sheet."

The Solution: The "Codebook" Compression

The authors of this paper, Ali Abbasi and his team, realized they were carrying too much weight. They asked: "Do we really need to write out every single word of the Chef's critique? Can we just give the apprentice a code?"

They invented a Vector-Quantized Autoencoder (VQAE). Here is how it works using a simple analogy:

The Dictionary (Codebook): Imagine the Master Chef creates a small, special dictionary of "Standard Critique Templates." Instead of writing a unique 10-page essay for every photo, the Chef just says, "This photo matches Template #42."
The Encoder: A smart assistant looks at the Chef's massive, detailed critique and finds the closest match in the dictionary. It doesn't save the whole essay; it just saves the number 42.
The Decoder: When the apprentice is ready to learn, they look up Template #42 in their small dictionary. The template is a simplified version of the original critique, but it's close enough to teach the apprentice effectively.

Why This is a Big Deal

Massive Savings: Instead of storing a 10-page essay for every photo, they only store a 3-digit number. This shrinks the storage needs by 30 to 40 times compared to previous methods.
No Loss of Quality: Even though they threw away the "fluff" and kept only the "code," the apprentice still learns almost as well as if they had the full, heavy library. They retained over 90% of the original performance.
Works Everywhere: They tested this on images (like recognizing cats and dogs) and even on language models (teaching AI to write text). In the language world, where the "dictionary" of possible words is huge (50,000+ words), this compression turned a storage need of 112 Gigabytes down to just 200 Megabytes. That's like shrinking a whole library down to a single smartphone!

The Takeaway

This paper solves a hidden bottleneck in AI training. For a long time, researchers focused on making the "photos" smaller, ignoring the fact that the "comments" were the real heavy lifters. By compressing those comments into efficient codes, they made it possible to share and train AI models on massive datasets without needing supercomputers or massive hard drives.

In short: They figured out how to send a "text message" instead of a "novel" to teach an AI, saving massive amounts of space while keeping the lessons just as powerful.

1. Problem Statement

Dataset Distillation (DD) aims to synthesize a small, informative subset of data that allows a student model to achieve performance comparable to one trained on a massive original dataset. While recent advances have successfully scaled DD to large datasets like ImageNet-1K by using soft labels (probability distributions from a teacher model) rather than hard labels, a critical bottleneck has emerged: storage and communication overhead.

The Bottleneck: In modern DD pipelines, each synthetic image is associated with soft labels generated across multiple augmentations. For large-scale tasks (e.g., ImageNet-1K with 1,000 classes or NLP tasks with 50,000+ token vocabularies), the storage required for these soft labels often exceeds the storage required for the image or text data itself.
The Limitation: Storing soft labels at full precision (16-bit or 32-bit floats) for billions of tokens or millions of images makes dataset distillation impractical for resource-constrained settings or scenarios requiring data transfer (e.g., Company A sharing distilled data with Company B without sharing the teacher model).
The Gap: Existing compression methods (like random pruning or simple quantization) either degrade performance significantly or fail to address the specific structure of soft label distributions effectively.

2. Methodology

The authors propose a Vector-Quantized Autoencoder (VQAE) framework to compress soft labels. The method is orthogonal to the data synthesis strategy, meaning it can be integrated with existing DD pipelines (e.g., SRe2L, CDA, RDED).

Core Architecture

The VQAE consists of three main components:

Linear Encoder: Projects the soft label vector $y \in \mathbb{R}^c$ (where $c$ is the number of classes) into a latent space $h \in \mathbb{R}^{d_h}$ using a learnable matrix $P$ .
Segmentation & Quantization: The latent vector $h$ is split into $m$ segments. Each segment is quantized by mapping it to the nearest vector in a shared, learnable codebook $\mu = \{\mu_1, \dots, \mu_k\}$ . This replaces continuous values with discrete indices.
Linear Decoder: Reconstructs the soft label $\hat{y}$ from the quantized latent vector using a decoder matrix $D$ .

Training and Optimization

Caching Stage: The VQAE is trained to minimize reconstruction error between the original teacher outputs and the reconstructed outputs, alongside standard Vector Quantization losses (commitment loss and codebook update loss).
Distillation Stage:
- Only the code indices (discrete integers) and the codebook/decoder are stored and transmitted.
- At the student training phase, the indices are used to retrieve codebook vectors, which are decoded and renormalized (via softmax) to ensure they form a valid probability distribution.
- The student is trained using Kullback-Leibler (KL) divergence between the reconstructed soft labels and its own predictions.

Compression Ratio

The storage cost is reduced from $N \times A \times C$ (where $N$ is samples, $A$ is augmentations, $C$ is classes) to roughly $N \times A \times m$ (indices) plus the fixed size of the codebook and decoder. This allows for massive compression ratios (e.g., 30–40×) while preserving the distributional information essential for generalization.

3. Key Contributions

Rigorous Analysis: The paper provides a quantitative analysis of the bit requirements in dataset distillation, highlighting that soft labels are the dominant storage cost in large-class settings, often overlooked by previous works.
VQAE Framework: Introduction of the first application of Vector-Quantized Autoencoders specifically for compressing soft labels in dataset distillation. This approach learns a discrete embedding space for teacher outputs.
Orthogonality: The method is designed to be agnostic to the input data synthesis method, allowing seamless integration with state-of-the-art DD frameworks like RDED, LPLD, SRe2L, and CDA.
Scalability to NLP: The method is extended to Large Language Model (LLM) distillation, addressing the prohibitive storage costs of token-level soft labels (vocabularies >50k).

4. Experimental Results

The authors evaluated the method on both vision (ImageNet-1K) and language (GPT-2, LLaMA) benchmarks.

Vision Benchmarks (ImageNet-1K)

Performance Retention: The proposed VQAE achieves 30–40× additional compression over baselines (RDED, LPLD, SRe2L, CDA) while retaining >90% of the original performance.
Comparison: Under high compression (e.g., 40×), the VQAE method significantly outperforms LPLD (which uses random pruning). For example, at IPC=10 and 40× compression, VQAE achieves ~36.3% accuracy compared to LPLD's ~29.1% on ResNet-18.
Robustness: The method performs consistently across different teacher-student architectures (ResNet, ShuffleNet, EfficientNet, Swin Transformer).

Language Benchmarks (LLMs)

Storage Reduction: In a token-level distillation task using GPT-2, storing full soft labels required ~112GB. The VQAE framework reduced this to 200MB, a 560× reduction.
Performance: The compressed soft labels allowed the student model to match or outperform standard Knowledge Distillation (KD) and Sequence-level KD baselines on benchmarks like Dolly, Self-Instruct, and Vicuna.
Optimization: The authors found that combining Top-K selection (keeping only the top 20 logits) with VQAE yields the best results for language tasks.

Ablation Studies

Baselines: The method outperforms simple quantization, PCA, Robust PCA, and Top-K selection alone.
Hyperparameters: Increasing the codebook size ( $k$ ) yields diminishing returns, whereas increasing the code dimension ( $d_c$ ) improves performance more significantly but at a higher storage cost. The authors identified a stable performance region where the ratio $d_c / \log_2 k$ is constant.

5. Significance and Impact

Enabling Scalable Distillation: By solving the storage bottleneck, this method makes dataset distillation viable for massive datasets (ImageNet-1K, ImageNet-21K) and large-scale NLP tasks where soft label storage was previously a prohibitive barrier.
Decoupling Knowledge Transfer: It facilitates scenarios where a model owner (Company A) can share distilled knowledge (compressed soft labels) with a third party (Company B) without needing to share the massive teacher model or incur high communication costs.
Preserving Information: The results challenge the notion that soft labels must be stored at full precision. The VQAE demonstrates that the "knowledge" in soft labels is highly compressible via discrete representations without sacrificing the student model's ability to generalize.

In conclusion, this paper establishes that soft label compression is a critical, previously under-explored component of dataset distillation. The proposed VQAE offers a simple, effective, and plug-and-play solution that dramatically reduces storage requirements while maintaining high model performance.