A Dataset is Worth 1 MB

Imagine you are a teacher trying to teach a class of students who are scattered across the globe. Some are in high-tech classrooms with super-fast internet; others are in remote villages with slow, crackling radio connections.

Usually, to teach them a new subject (like "How to identify different types of birds"), you would have to send them a massive textbook full of thousands of photos. But if a student only has a tiny, slow radio, sending that whole book would take weeks, or the signal might break before it arrives.

The Problem:
Sending the actual photos (the "pixels") is too heavy and expensive. Sending just the teacher's notes (a pre-trained model) doesn't work either because every student has a different type of notebook (different computer hardware) and needs to learn in their own way.

The Solution: "Pseudo-Labels as Data" (PLADA)
This paper proposes a brilliant shortcut. Instead of sending the photos, the teacher sends only the answers, assuming the students already have a giant, generic photo album in their pockets.

Here is how it works, broken down into simple steps:

1. The Shared Photo Album (The Reference Dataset)

Imagine every student already has a copy of a massive, famous photo encyclopedia (like ImageNet) sitting on their hard drive. It has 14 million pictures of everything: cats, cars, flowers, mountains, and clouds. They don't need to download this; it's already there.

2. The "Cheat Sheet" (The Payload)

The teacher wants the students to learn about Birds.
Instead of sending 1,000 new bird photos, the teacher opens the students' existing encyclopedia, finds the 1,000 pictures that look most like birds, and writes down a tiny list of labels:

Photo #4,502: "Robin"
Photo #12,100: "Eagle"
Photo #8,999: "Sparrow"

This list of labels is incredibly small. It's like sending a text message instead of a DVD. The whole message might be less than 1 Megabyte (smaller than a single low-res photo).

3. The "Smart Filter" (Pruning)

Here is the tricky part: The encyclopedia has 14 million photos, but most of them aren't birds. If the teacher just sent labels for every photo, the list would still be too long. Plus, labeling a picture of a toaster as "Bird" would confuse the student.

So, the teacher uses a Smart Filter:

They look at the encyclopedia and ask, "Which of these 14 million photos does my 'Bird Expert' AI think looks most like a bird?"
They keep only the top 1% (the best matches) and throw away the rest.
The Result: The teacher only sends labels for the 140,000 photos that actually look like birds. The rest are ignored.

4. The "Safety Net" (Fixing Imbalance)

Sometimes, the filter gets lazy. It might pick 100 photos of "Robins" but zero photos of "Ostriches" because Robins are easier to spot. This would make the student bad at recognizing Ostriches.

To fix this, the teacher uses a Safety Net: They force the filter to pick at least a few photos for every type of bird, even the rare or hard-to-find ones. This ensures the student learns a balanced lesson.

5. The Result

The student receives a tiny text file (less than 1 MB). They open their local encyclopedia, look up the specific photos mentioned in the text file, and use the labels to train their own "Bird Expert" model.

Why is this amazing?

Speed: Sending 1 MB takes seconds, even on a slow underwater cable or a rover on Mars.
Quality: Surprisingly, the students learn almost as well as if they had the original photos. The "best" photos from the generic encyclopedia are often good enough to teach the specific task.
Efficiency: It turns a massive data transfer problem into a tiny text message problem.

The Analogy in a Nutshell

Think of it like a Scavenger Hunt.

Old Way: The organizer sends you a box of 1,000 specific items you need to find. (Heavy, slow).
New Way (PLADA): You already have a giant warehouse of random junk. The organizer just sends you a tiny note saying: "Go to shelf 4, bin 12, and call that item 'Gold'. Go to shelf 9, bin 5, and call that item 'Silver'."

You use your own warehouse and the tiny note to learn the game. The note is so small it fits in a pocket, but it teaches you everything you need to know.

The Bottom Line:
This paper proves that for many tasks, you don't need to send the images to teach a computer. You just need to send the names of the images, provided the computer already has a huge library of images to choose from. It's a massive win for saving bandwidth and energy.

1. Problem Statement

The core challenge addressed is the high communication cost of distributing large training datasets from a central server to multiple remote clients (agents).

Context: In heterogeneous environments (e.g., autonomous vehicles, medical devices, deep-sea rovers), clients often have diverse hardware/software constraints and cannot simply receive pre-trained model weights. They require raw data to train task-specific models locally.
Limitations of Current Methods:
- Direct Transmission: Sending raw datasets (often GBs in size) is infeasible for bandwidth-constrained channels (e.g., underwater acoustic links at ~5 kbps).
- Dataset Distillation: Existing methods attempt to synthesize a small set of images and labels. However, these struggle to scale to high-resolution data (like ImageNet-21K), require massive computational resources, and often result in synthetic images that are still too large to transmit efficiently.
Goal: Transmit task knowledge with a payload significantly smaller than 1 MB while maintaining high classification accuracy, without sending any pixel data.

2. Methodology: Pseudo-Labels as Data (PLADA)

The authors propose PLADA, a paradigm shift that inverts the standard dataset distillation framework. Instead of synthesizing images while keeping labels fixed, PLADA synthesizes labels while keeping images fixed.

Core Assumptions

Preloaded Reference Dataset: Every remote agent is preloaded with a massive, generic, unlabeled reference dataset (e.g., ImageNet-1K or ImageNet-21K).
Task Communication: The server does not send images. It sends only the hard class labels (integers) corresponding to specific images within the agent's local reference set.

The PLADA Pipeline

Teacher Training: The server trains a "teacher" classifier ( $f_{gt}$ ) on the target task's ground-truth data.
Label Generation: The teacher model predicts labels for the entire reference dataset.
Pruning (Filtering):
- Problem: Most images in a generic reference set are irrelevant to the specific target task (Out-of-Distribution or OOD). Transmitting labels for all images is inefficient and introduces noise.
- Solution: The server filters the reference set to retain only the most semantically relevant images.
- Metric: It uses Logit Energy (an OOD detection metric) to score images. Lower energy indicates higher confidence that the image belongs to one of the target classes.
- Safety-Net Filtering: To prevent "class collapse" (where easy classes dominate and rare classes are pruned), a quota system is applied. A portion of the bandwidth is reserved to ensure under-represented classes are retained, using a power-law weighting ( $\alpha$ ) to oversample tail classes.
Compression & Transmission:
- The server transmits a compressed payload containing:
  - Indices of the selected images (using Run-Length Encoding or delta encoding).
  - The corresponding hard labels.
- Compression: The payload is further compressed using Zstandard (Zstd).
Client Training: The remote agent reconstructs a "virtual dataset" using its local reference images and the received labels, then trains a "student" model locally.

3. Key Contributions

New Paradigm (PLADA): A method that achieves high-performance task transfer by transmitting only hard pseudo-labels for a preloaded reference set, reducing payload to < 1 MB (often < 200 KB).
Effective Pruning Mechanism: Introduction of an Energy-based OOD filtering strategy combined with a Safety-Net mechanism. This ensures that only semantically relevant images are used while preserving class balance, improving both accuracy and bandwidth efficiency.
Empirical Validation: Demonstrated on 14 diverse datasets (including fine-grained natural images and medical OOD datasets), showing that PLADA outperforms traditional data transmission (Random Subsets, Coresets) and Dataset Distillation methods in extreme low-bandwidth regimes.

4. Experimental Results

Bandwidth Efficiency: PLADA achieves high accuracy with payloads between 45 KB and 206 KB (for a 1% keep rate on ImageNet-21K), compared to megabytes or gigabytes for other methods.
Accuracy Performance:
- On natural image datasets (e.g., Oxford-Flowers, Caltech-101), PLADA achieves accuracy comparable to or better than training on the full reference set, often outperforming it due to the "denoising" effect of pruning irrelevant images.
- Medical Datasets (OOD): Even for medical datasets (e.g., BloodMNIST) which are distributionally distant from ImageNet, PLADA achieves non-trivial accuracy. Interestingly, for these far-OOD tasks, selecting images with high energy (high uncertainty) sometimes performed better, suggesting an adaptive strategy is needed.
Comparison:
- vs. Random/Coreset: PLADA significantly outperforms methods that transmit raw image subsets, which fail catastrophically at <1 MB payloads.
- vs. Model Weights: Transmitting labels is more efficient than transmitting model weights (even quantized) for clients that need to train custom architectures.
Scaling: Using the larger ImageNet-21K (14.2M images) as a reference consistently yields better results than ImageNet-1K, as it provides a richer pool of semantic neighbors for fine-grained tasks.

5. Significance and Implications

Feasibility for Extreme Constraints: The method makes it possible to train models on devices with ultra-narrow bandwidth (e.g., deep-sea vehicles, space rovers) where transmitting even a single image is impossible.
Storage vs. Bandwidth Trade-off: PLADA shifts the burden from communication bandwidth to local storage. Since the reference dataset is static and shared across many tasks, the storage cost is amortized, making it highly cost-effective for serving multiple tasks.
Future Directions: The paper suggests that "labels are worth more than pixels" for classification tasks. It opens avenues for optimizing reference dataset selection and extending the approach to regression or generative tasks.

In summary, PLADA demonstrates that for classification tasks, task knowledge can be conveyed more efficiently through compact labels than through pixels, provided the receiver has access to a sufficiently large and diverse reference dataset.