PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors

The Big Problem: The "Echo Chamber" Effect

Imagine you are trying to teach a student (an AI) how to recognize different animals. Instead of showing them thousands of real photos, you want to create a tiny, perfect "cheat sheet" of just a few synthetic images that contains all the necessary knowledge. This is called Dataset Distillation.

The problem with current methods (like the popular SRe2L) is that they rely on one single teacher to create this cheat sheet.

Think of this like asking one specific art critic to describe a "Dog" to you.

If that critic only likes Golden Retrievers, they will describe every dog as having golden fur and floppy ears.
If they only like Chihuahuas, they will describe every dog as tiny and yappy.

Because the AI is learning from only one person's perspective, the "synthetic" images it creates become boring and identical. They all look the same (homogeneous). When the AI tries to learn from these boring images, it gets confused when it sees a real dog that looks different (like a Great Dane), and it fails to generalize.

The Solution: PRISM (The "Panel of Experts")

The authors of this paper propose PRISM (PRIors from diverse Source Models).

Instead of asking one art critic to describe the dog, PRISM asks a panel of diverse experts to describe it simultaneously.

Expert A (a Logit Teacher) focuses on the shape and identity (Is it a dog?).
Expert B (a BN Teacher) focuses on the texture, lighting, and natural feel (Does this look like a real photo?).
Expert C might be a different type of expert entirely (e.g., a different neural network architecture).

The Magic Trick: Decoupling
In the old way, the same expert had to do both jobs (describe the shape and the texture). In PRISM, they decouple (separate) these jobs.

They use Teacher A to guide the meaning of the image.
They use Teacher B (who might be a completely different type of AI) to guide the visual style.

By mixing these different "views" of the world, the resulting synthetic images are much more diverse. You get a Golden Retriever, a Chihuahua, a puppy, and an old dog all in the same "cheat sheet," rather than just ten identical Golden Retrievers.

How It Works in Practice

The Setup: Imagine you are making a collage of 100 images for the "Dog" category.
The Old Way (SRe2L): You ask one AI to generate all 100 images. They all end up looking suspiciously similar because the AI has a "bias" toward a specific look.
The PRISM Way:
- You ask AI Model X to tell you what features make a dog recognizable (the "Logits").
- You ask AI Model Y (which is built differently) to tell you what makes a dog look natural and not like a glitchy computer graphic (the "Batch Normalization" or texture).
- You combine their advice. The result is an image that is both correctly identified as a dog and visually diverse.

Why This Matters (The Results)

The paper tested this on ImageNet-1K, a massive dataset with 1,000 categories.

Better Accuracy: When they trained new AIs on these diverse PRISM images, the AIs got much higher scores (up to 70.4% accuracy) compared to the old methods.
More Diversity: They measured how similar the images were to each other. The old methods produced images that were 90% similar (boring). PRISM produced images that were much more different from each other (diverse), which helps the AI learn to handle real-world chaos.
Scalability: They managed to do this efficiently on a huge dataset, proving that you don't need to sacrifice speed for quality.

The Takeaway

PRISM is like realizing that to understand the world, you shouldn't listen to just one person. By letting different AI models with different "personalities" and "architectures" teach the synthetic data generation process, the result is a much richer, more robust, and more useful dataset.

It solves the "homogeneity" problem by ensuring the synthetic data isn't just a mirror of one AI's bias, but a mosaic of many different perspectives.

1. Problem Statement

Dataset Distillation (DD) aims to compress large real-world datasets into small synthetic sets that retain the training efficacy of the original data. However, existing state-of-the-art methods (e.g., SRe2L, G-VBSM) suffer from a critical limitation: lack of intra-class diversity.

The Root Cause: Current approaches typically rely on a single teacher model to supervise the entire distillation process. This single model imposes a strong, singular inductive bias (a specific "view" of the world) onto the synthetic data.
The Consequence: As dataset size increases (e.g., scaling to ImageNet-1K), this bias drives the generation of overly smooth, homogeneous samples. The resulting synthetic datasets lack the variability necessary to train robust, generalizable models, often leading to overfitting and reduced performance on unseen data.
The Gap: While previous works attempted to improve diversity through complex post-processing, data initialization tricks, or ensembles that still couple objectives, they fail to address the fundamental issue that a single architecture cannot represent the full distribution of a class.

2. Methodology: PRISM

The authors propose PRISM (PRIors from diverse Source Models), a framework that decouples architectural priors during the synthesis process. Instead of using one model for all supervision, PRISM separates the optimization objectives and assigns them to different teacher architectures.

Core Mechanism: Dual-Teacher Decoupling

Standard DD (like SRe2L) minimizes a loss function combining Logit Matching (classification accuracy) and Regularization (naturalness via Batch Normalization statistics) using a single teacher $\phi_\theta$ .
PRISM splits these objectives:

Logit Teacher ( $\phi_\varphi$ ): A primary model responsible for the classification loss (logit matching). This ensures the synthetic data is semantically correct for the target classes.
BN Teacher(s) ( $\phi_\theta$ ): A separate set of models responsible for Batch Normalization (BN) alignment. This acts as a regularizer, ensuring the synthetic images possess "natural" low-level feature statistics (mean and variance) to prevent adversarial-like artifacts.

Key Innovation: The Logit Teacher and BN Teacher(s) can have different architectures (e.g., an EfficientNet for logits and a ResNet for BN). This forces the synthetic data to satisfy the constraints of multiple, distinct "world views" simultaneously.

Generalized Multi-Teacher Alignment

PRISM extends this concept by allowing a pool of $k$ diverse models for BN alignment.

Sampling Strategy: For each synthetic image, a subset of BN teachers is sampled from a pool.
Optimization: The regularization term becomes an expectation over the selected subset of BN teachers:
$\mathcal{L}_{total} = \ell(\phi_\varphi(X_s), Y_s) + \lambda \mathbb{E}_{M_{sub} \sim P} \left[ \sum_{\omega \in M_{sub}} R^{BN}_\omega(X_s) \right]$
Selection Strategy: The paper analyzes Pre-distillation selection (fixing the teacher set per image before optimization) vs. Intra-distillation selection (dynamically changing teachers during optimization). Results show that pre-distillation selection of diverse teachers is more effective.

Batch Formation Strategy

Unlike methods that form intra-class batches (mixing multiple images of the same class to enforce diversity), PRISM follows the cross-class batch strategy (like SRe2L).

It processes each Image-Per-Class (IPC) index independently across all classes.
Advantage: This allows for massive parallelization across GPUs, making the method highly scalable to large datasets like ImageNet-1K without the computational overhead of complex intra-batch dependencies.

3. Key Contributions

PRISM Framework: A novel method that decouples logit supervision and BN regularization, utilizing distinct teacher architectures to break the homogeneity constraint of single-teacher distillation.
Systematic Teacher Analysis: A comprehensive study demonstrating that pre-distillation selection of diverse teachers outperforms dynamic selection, and that using multiple BN teachers yields the best results.
Scalable Pipeline: Introduction of a simple, massively parallelizable synthesis pipeline that scales efficiently to ImageNet-1K without relying on complex post-recovery strategies or knowledge distillation to achieve diversity.
Quantifiable Diversity: Proof that decoupling architectural priors leads to significantly richer intra-class diversity, measured by lower cosine similarity between features.

4. Experimental Results

The authors evaluated PRISM on ImageNet-1K across various Image-Per-Class (IPC) regimes (10, 50, 100) and backbone architectures (ResNet-18/50/101).

Performance (Top-1 Accuracy):
- PRISM consistently outperforms state-of-the-art methods (SRe2L, G-VBSM, EDC, DELT), particularly in mid-to-high IPC regimes.
- Key Result: On ResNet-101 with IPC=100, PRISM achieves 70.4% accuracy, surpassing the previous best (DELT) of 67.6% and SRe2L's 62.8%.
- It sets new SOTA across all tested architectures and evaluation protocols.
Diversity Metrics:
- Cosine Similarity: PRISM-generated images show significantly lower intra-class cosine similarity (mean ~0.83) compared to SRe2L, G-VBSM, and DELT (0.86–0.92). This confirms the generated data is less homogeneous.
- Visual Quality: Qualitative comparisons (UMAP and image grids) show PRISM produces samples with varied contexts, colors, and textures, whereas SRe2L produces uniform, blurry clusters.
Ablation Studies:
- Decoupling: Simply decoupling the teachers (Dual-Teacher) improves accuracy over single-teacher baselines.
- Multi-Teacher BN: Increasing the number of BN teachers from 1 to 4 further boosts performance (e.g., from 36.2% to 39.1% in recovery-only settings).
- Selection: Pre-distillation selection is superior to intra-distillation dynamic selection.

5. Significance and Impact

Orthogonal Scaling Axis: PRISM introduces a new dimension for scaling dataset distillation that is orthogonal to existing methods (which focus on training schedules, initialization, or post-processing). It proves that architectural diversity is a critical, previously underutilized factor in generating high-quality synthetic data.
Robustness and Privacy: By generating diverse, non-homogeneous datasets, PRISM addresses the limitations of current DD methods in privacy-preserving machine learning and robustness against adversarial attacks, where homogeneous data often fails to generalize.
Efficiency: The method achieves these gains without complex, computationally expensive post-recovery steps, relying instead on a clean, parallelizable synthesis pipeline.

Conclusion: PRISM demonstrates that the "inductive bias" of a single model is a bottleneck for dataset distillation. By decoupling the supervision of semantic content (logits) from structural naturalness (BN statistics) and utilizing diverse architectural priors, PRISM generates synthetic datasets that are both highly performant and semantically diverse, setting a new standard for large-scale dataset distillation.