Dataset Distillation via Committee Voting

Imagine you have a massive library containing millions of books (the original dataset). You want to teach a student (an AI model) everything important from this library, but you don't have the time, money, or space to let them read every single book.

Dataset Distillation is the art of creating a tiny, "super-summary" book that contains all the essential knowledge of the library, allowing the student to learn just as well but in a fraction of the time.

However, there's a catch: If you ask just one librarian to write this summary, they might miss important details, focus too much on their favorite topics, or get confused by the sheer volume of information. Their summary might be biased or incomplete.

This is where the paper "Dataset Distillation via Committee Voting" (CV-DD) comes in.

The Core Idea: The "Committee" Approach

Instead of relying on a single librarian, the authors propose hiring a Committee of Experts.

Think of it like a panel of judges on a talent show. If you have only one judge, their personal taste might skew the results. But if you have five judges with different backgrounds (one loves rock, one loves jazz, one is a technical expert, etc.), and you combine their opinions, you get a much fairer, more accurate, and more robust decision.

In this paper:

The Experts: They use several different AI models (like ResNet, MobileNet, DenseNet) as the "committee members." Each model "sees" the data slightly differently.
The Voting: Instead of letting one model dictate the summary, the committee votes on what the "perfect" summary image should look like.
The Smart Weighting: Not all votes are equal. If one expert has a history of being very accurate (high "prior performance"), their vote counts more. If an expert is struggling, their vote counts less. This ensures the best ideas drive the creation of the summary.

The Secret Sauce: Two New Tricks

The authors didn't just bring in a committee; they also fixed two major problems that usually happen when trying to summarize data.

1. The "Ghost" Problem (Batch-Specific Soft Labeling)

Imagine you are trying to teach a student using a summary book. The teacher (the AI) gives the student a "soft label"—a hint like, "This picture is 80% likely a cat, 20% a dog."

Usually, the teacher looks at the real library to give these hints. But the summary book (synthetic data) looks slightly different from the real library. It's like the teacher is wearing glasses that make the summary book look blurry compared to the real thing. This causes the hints to be wrong.

The Fix: The authors invented a trick called Batch-Specific Soft Labeling. Instead of looking at the real library through their glasses, the teacher looks directly at the summary book page they are currently teaching from. They adjust their glasses to match the specific page they are holding. This makes the hints much more accurate, helping the student learn better.

2. The "Smooth" Learning (Smoothed Learning Rate)

When the committee is writing the summary, they are constantly tweaking the images. If they make big, jerky changes, the summary becomes messy. If they move too slowly, it takes forever.

The Fix: They use a "Smoothed Learning Rate." Think of this like a car approaching a stop sign. Instead of slamming the brakes or coasting too slowly, the car gently slows down in a perfect curve. This helps the committee settle on the perfect summary without overshooting or getting stuck in a bad spot.

Why Does This Matter?

Less Bias: By listening to a diverse group of models, the summary doesn't lean too heavily on one specific way of seeing the world.
Better Generalization: The resulting "summary book" works great even if you use a different student (a different AI model) to read it later.
Efficiency: It saves massive amounts of computing power and time. You can train powerful AI models on a tiny dataset that was distilled using this method, rather than needing the whole massive library.

The Bottom Line

The paper says: "Don't ask one person to summarize a million books. Ask a diverse team of experts, let them vote based on who is best at what, and make sure they adjust their teaching style to match the specific page they are working on."

The result is a tiny, high-quality dataset that teaches AI models faster, cheaper, and more accurately than ever before.

1. Problem Statement

Dataset Distillation aims to synthesize a compact, representative dataset ( $D_{syn}$ ) from a large original dataset ( $D$ ) such that models trained on $D_{syn}$ achieve performance comparable to those trained on $D$ .

Key Challenges Identified:

Single-Model Bias: Existing state-of-the-art (SOTA) methods (e.g., SRe2L, RDED) typically rely on a single backbone architecture. This limits the diversity of the distilled data and introduces model-specific biases, hindering generalization across different architectures.
Suboptimal Ensemble Strategies: Previous ensemble-based approaches often treat all models equally (uniform weighting), failing to account for the varying informativeness or "prior performance" of individual models.
Distribution Shift: There is a significant gap between the statistical properties (specifically Batch Normalization statistics) of synthetic data and real data, leading to suboptimal soft labels and poor generalization during post-evaluation.
Overfitting: Distilled datasets often overfit to specific patterns or noise, reducing robustness in cross-architecture or cross-dataset scenarios.

2. Methodology: CV-DD Framework

The authors propose Committee Voting for Dataset Distillation (CV-DD), a framework that leverages the collective knowledge of multiple diverse models to generate high-quality synthetic data. The approach consists of three core components:

A. Strong Baseline (SRe2L++)

Before introducing the voting mechanism, the authors establish a robust baseline by refining the SRe2L method with:

Real Image Initialization: Replacing Gaussian noise with real images for initialization.
Data Augmentation: Applying RandomResizedCrop during synthesis.
Optimization Tweaks: Using smaller batch sizes and smoothed learning rate schedules (Cosine Annealing) to prevent under-convergence and suboptimal minima.

B. Prior Performance Guided Voting Strategy

This is the core innovation. Instead of using a single model or averaging all models equally, CV-DD employs a committee of diverse architectures (e.g., ResNet18, ResNet50, ShuffleNetV2, MobileNetV2, DenseNet121).

Prior Performance Estimation: Each model in the committee is pre-trained on the original dataset to distill a small dataset and evaluate its generalization capability on a validation set. This accuracy score ( $\alpha_i$ ) serves as the "prior performance."
Weighted Voting: During the optimization of synthetic images, the loss is a weighted sum of losses from a randomly sampled subset of the committee. The weights are determined by a SoftMax function of the prior performance scores:
$w_i = \frac{\exp(\alpha_i / T)}{\sum \exp(\alpha_j / T)}$
where $T$ is a temperature parameter. This ensures that stronger, more informative models have a greater influence on the gradient updates, steering the synthetic data toward directions that promote generalization.
Theoretical Guarantee: The paper proves that greater committee diversity increases intra-class separation in the embedding space, and prior-weighted voting aligns the update direction more closely with the generalization risk gradient compared to uniform averaging.

C. Batch-Specific Soft Labeling (BSSL)

To address the distribution shift between synthetic and real data:

Problem: Standard methods use running statistics from a teacher model trained on real data. However, synthetic data often fails to match these statistics perfectly due to optimization randomness.
Solution: BSSL recomputes the Batch Normalization (mean and variance) statistics specifically for each batch of synthetic data during the soft label generation phase, while keeping the teacher model's weights fixed.
Extension: For architectures without native BatchNorm (e.g., ViT), the authors introduce explicit BN layers (BN-ViT) to enable this mechanism.

3. Key Contributions

Novel Framework (CV-DD): A new paradigm for dataset distillation that integrates multiple model perspectives via a prior-performance-guided voting strategy, effectively reducing model-specific bias.
Strong Baseline Establishment: The authors refine existing methods to create SRe2L++, which already achieves SOTA performance, ensuring that the improvements from CV-DD are genuine and not due to weak baselines.
Batch-Specific Soft Labeling (BSSL): A technique to mitigate distribution shifts by dynamically recalculating BN statistics for synthetic batches, significantly improving post-evaluation performance.
Theoretical Analysis: Provided proofs demonstrating that committee diversity enhances data diversity and that prior-weighted voting yields better generalization gradients than uniform voting.

4. Experimental Results

The method was evaluated on CIFAR-10/100, Tiny-ImageNet, ImageNet-1K, and subsets (ImageNette, ImageWoof).

Performance Gains: CV-DD consistently outperforms SOTA methods (including CDA, RDED, SRe2L++, MTT, and G-VBSM).
- ImageNet-1K (IPC=50, ResNet18): Achieved 59.5% accuracy, surpassing SRe2L++ (57.6%) and RDED (56.5%).
- CIFAR-100 (IPC=10): Achieved 61.8%, outperforming RDED by +19.2% and SRe2L++ by +5.1%.
Cross-Architecture Generalization: CV-DD-trained datasets achieved the highest Top-1 accuracy across 9 diverse architectures (from lightweight MobileNetV2 to heavy RegNetX and WRN), demonstrating superior transferability.
Robustness:
- Overfitting: CV-DD showed lower training accuracy but higher test accuracy compared to baselines, indicating effective regularization.
- Noisy Teachers: The method remained robust even when the committee included overfitted models, as the voting mechanism naturally down-weighted them.
- Synthetic-to-Real Transfer: On the VisDA-2017 benchmark, CV-DD outperformed SRe2L++ by +1.8%, showing resilience to domain shifts.
Efficiency: While ensemble methods are typically slower, CV-DD is computationally efficient. It is 2.41ms faster per iteration than G-VBSM and significantly reduces total distillation time (e.g., 137.5 hours vs. 187.5 hours for G-VBSM on ImageNet-1K, IPC=50).

5. Significance

Scalability and Efficiency: CV-DD offers a scalable solution for resource-constrained environments by condensing massive datasets into high-quality, small synthetic sets without sacrificing generalization.
Paradigm Shift: It moves the field away from single-model distillation toward committee-based distillation, leveraging the "wisdom of crowds" to capture richer data features and reduce bias.
Plug-and-Play Versatility: The committee voting mechanism is modular and can be integrated into various distillation frameworks (both training-based like SRe2L and non-training-based like RDED), making it a versatile tool for future research.
Practical Impact: By improving the quality of distilled data, CV-DD reduces the computational cost, memory usage, and energy consumption required for training large-scale models, facilitating broader access to efficient AI development.

In conclusion, CV-DD represents a significant advancement in dataset distillation by combining diverse model perspectives with sophisticated statistical alignment, setting a new standard for performance, robustness, and generalization in synthetic data generation.