From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness

The Big Problem: The "Too Much Data" Traffic Jam

Imagine you are a teacher trying to teach a student (an AI model) how to recognize animals. You have a massive library of 50,000 photos of cats, dogs, and birds.

The Old Way (Dataset Distillation):
Previously, researchers tried to solve the problem of "too much data" by picking a tiny, perfect handful of photos (say, 10 photos per animal) that represented the whole library. They called this Dataset Distillation.

Analogy: It's like trying to summarize a 1,000-page novel by picking just 10 sentences. If you pick the right sentences, the student learns the story perfectly. If you pick the wrong ones, they get confused.

The Flaw:
The old method only cared about how many photos you kept. It assumed every photo was a high-definition, 32-bit masterpiece. But in the real world (like on a smartphone or a sensor in a forest), sending high-definition photos takes a lot of bandwidth and storage. It's like trying to send a 4K movie over a dial-up internet connection.

The New Idea: "From Fewer Samples to Fewer Bits"

The authors of this paper, QuADD, say: "Stop worrying just about the number of photos. Let's worry about the total size of the data."

They propose a new way to think about efficiency: The Bit Budget.
Imagine you have a strict limit on how much "digital space" you can use to send your lesson.

Old Strategy: Send 10 high-definition photos (Huge size).
New Strategy: Send 50 low-resolution, sketch-like photos (Same total size, but more variety).

The paper argues that more variety at lower quality is often better than less variety at high quality.

How It Works: The "Smart Sketch" Factory

To make this work, they built a system called QuADD (Quantization-aware Dataset Distillation). Here is how it works, step-by-step:

1. The "Smart Sketch" (Differentiable Quantization)

Usually, if you take a high-quality photo and shrink it to a sketch (lower precision), you lose details, and the AI gets confused.

The Innovation: QuADD doesn't just shrink the photo at the end. It teaches the AI to draw the sketch while it's learning.
Analogy: Imagine a chef teaching a student. Instead of giving the student a perfect, expensive steak and then telling them to eat it with a dull knife (which ruins the meal), the chef teaches the student how to cook a delicious meal using a dull knife from the very first lesson. The student learns exactly what ingredients work best with that specific tool.

2. The "Adaptive Palette" (Non-Uniform Quantization)

The system uses a clever trick called Adaptive Non-Uniform Quantization.

Analogy: Think of a painter's palette.
- Uniform (Old Way): The painter uses the same size of paint blobs for everything. A tiny speck of dust gets the same amount of paint as a giant mountain. This wastes paint on the dust and leaves the mountain looking muddy.
- Adaptive (QuADD Way): The painter looks at the picture. They use tiny, precise dots for the detailed parts (like a cat's whiskers) and big, broad strokes for the simple parts (like the sky).
- Result: QuADD learns to put the "digital bits" exactly where the information is most important, saving space elsewhere.

3. The "Sweet Spot" Discovery

The researchers tested this by playing a game: "How many photos vs. how much detail?"

They found a Sweet Spot: It is often better to have many low-quality samples than a few high-quality ones.
Why? Because AI learns better from seeing many different examples (variety) than from seeing the same perfect example a few times. Even if the examples are "grainy," the sheer number of them helps the AI understand the concept better.

The Results: Saving Space Without Losing Smarts

They tested this on two very different things:

Images: Recognizing cats and dogs (CIFAR-10).
Wireless Signals: Helping cell towers find the best signal beam (3GPP data).

The Outcome:

Massive Savings: They compressed the data by 10x to 180x (depending on the task).
No Loss in Smarts: Despite the data being "grainy" and tiny, the AI models trained on this data performed almost exactly as well as models trained on the massive, high-definition original data.

The Takeaway

This paper changes the goal of AI data compression.

Before: "Let's find the fewest number of perfect photos."
Now: "Let's find the most efficient way to send information, even if it means sending more 'rough drafts' instead of 'masterpieces'."

It's like realizing that to teach someone a language, you don't need a library of perfect dictionaries; you just need a pocket-sized phrasebook with enough words to get the job done. QuADD gives us that pocket-sized phrasebook for AI.

1. Problem Statement

Dataset Distillation (DD) aims to compress large real-world datasets ( $\mathcal{T}$ ) into small, synthetic datasets ( $\mathcal{S}$ ) such that a model trained on $\mathcal{S}$ achieves performance comparable to one trained on $\mathcal{T}$ .

Current Limitation: Existing DD methods focus primarily on reducing the number of samples (sample count, $M$ ) or the dimensionality of the data. They typically assume synthetic samples are stored at full precision (e.g., 32-bit floating point).
The Gap: In resource-constrained environments (e.g., IoT, edge computing, distributed learning), the total cost of data is determined by the total bit budget ( $M \times D \times b$ ), where $b$ is the bit precision per element.
The Challenge: Simply quantizing distilled data after training (post-quantization) often leads to significant accuracy drops because the synthetic samples were optimized for full precision, not low-bit representation. There is a lack of a unified framework that jointly optimizes the sample count and data precision to minimize the total information content (bits) while maintaining task performance.

2. Methodology: Quantization-aware Dataset Distillation (QuADD)

The authors propose QuADD, a unified framework that integrates a differentiable quantization module directly into the distillation loop. This enables end-to-end co-optimization of the synthetic data and the quantization parameters.

Core Components:

Differentiable Quantization Layer:
- To enable gradient-based optimization through non-differentiable operations (clipping and rounding), QuADD employs specific techniques:
  - Forward Pass: Uses either Hard Rounding (nearest neighbor) or Soft Rounding (continuous relaxation).
  - Backward Pass: Utilizes the Straight-Through Estimator (STE) for hard rounding or analytic surrogates for soft rounding to approximate gradients.
- The quantizer $Q(\cdot)$ maps full-precision synthetic data $\tilde{x}$ to quantized data $\tilde{x}_q$ .
Joint Optimization Objective:
- Instead of matching model responses between real data $\mathcal{T}$ and full-precision synthetic data $\mathcal{S}$ , QuADD matches responses between $\mathcal{T}$ and quantized synthetic data $\mathcal{S}_q$ .
- Loss Function:
  $\mathcal{L} = \mathbb{E}_{\theta \sim \Theta} [ \mathcal{L}_{task}(\phi(\mathcal{T}; \theta), \phi(\mathcal{S}_q; \theta)) ]$
- Gradients flow back through the quantizer to update both the synthetic samples ( $\mathcal{S}$ ) and the quantizer parameters (e.g., clipping thresholds).
Adaptive Non-Uniform Quantization (APoT):
- While uniform quantization is supported, the paper highlights Additive Powers-of-Two (APoT) quantization as the superior choice.
- Mechanism: APoT represents values as a sum of scaled powers of two. It learns the clipping threshold ( $\alpha$ ) and allocates finer quantization bins to information-dense regions of the data distribution.
- Benefit: This adaptive approach minimizes distortion under tight bit budgets compared to fixed-step uniform quantization.
Initialization Strategy:
- Synthetic data is initialized using a quantization-guided selection strategy (based on graph-cut criteria and gradient similarity) to ensure the starting point is robust to quantization noise.

3. Key Contributions

Paradigm Shift: Reframes Dataset Distillation from "fewer samples" to "fewer bits," treating precision ( $b$ ) and sample count ( $M$ ) as coupled, optimizable degrees of freedom under a fixed bit budget.
QuADD Framework: Introduces the first end-to-end differentiable framework for joint optimization of synthetic data and quantization parameters.
Adaptive Quantization: Proposes an adaptive non-uniform quantization module (APoT) that learns to allocate bits efficiently based on data density, outperforming static uniform quantization.
Cross-Domain Validation: Demonstrates that the method is not limited to visual data but generalizes to tabular data (wireless communication), proving its modality-agnostic nature.

4. Experimental Results

The authors evaluated QuADD on Image Classification (CIFAR-10/100, ImageNette) and a 3GPP Beam Management task (tabular wireless data).

Rate-Distortion Performance:
- QuADD significantly outperforms post-quantized baselines and existing DD methods (like DD, DM, TM, FreD, AutoPalette) under fixed bit budgets.
- Key Finding: For a fixed bit budget, using lower precision with more samples often yields higher accuracy than using high precision with fewer samples. This reveals a "sweet spot" where aggressive quantization (e.g., 2–3 bits per channel) combined with increased sample count maximizes information efficiency.
Compression Ratios:
- Image Data: Achieved comparable accuracy to full-precision baselines while reducing storage by >10× (e.g., 65.1% accuracy on CIFAR-10 with 10.6× compression).
- Tabular Data (3GPP): Achieved >180× compression (from full dataset to distilled) with minimal accuracy loss (81.9% accuracy vs. 89% full dataset baseline).
Training Efficiency:
- QuADD maintains competitive training times, often faster than parameterized methods like FreD or AutoPalette, as the quantization layer is lightweight.
Cross-Architecture Transfer:
- Distilled datasets from QuADD transfer effectively across different student architectures (AlexNet, VGG, ResNet), maintaining robustness.

5. Significance and Impact

Information Efficiency: The paper establishes a new standard for "information-efficient" learning, showing that the total bit content is a more critical metric for deployment than just sample count.
Practical Applicability: By enabling the transmission of highly compressed, low-bit datasets without sacrificing model performance, QuADD is highly relevant for:
- Edge AI: Reducing storage and bandwidth requirements on IoT devices.
- Federated/Distributed Learning: Lowering communication overhead when sharing data between nodes.
- Wireless Systems: Specifically addressing the 3GPP need for efficient beam management data exchange.
Generalizability: The framework's success on both visual and non-visual (tabular) data suggests it is a fundamental approach to data compression for machine learning, not just an image-specific trick.

In conclusion, QuADD demonstrates that by jointly optimizing how much data is stored (samples) and how precisely it is stored (bits), one can achieve superior efficiency and performance compared to methods that treat precision as a secondary, post-hoc consideration.