A Dataset is Worth 1 MB

This paper proposes PLADA, a method that drastically reduces dataset transmission costs to under 1 MB by sending only class labels for a preloaded reference dataset, utilizing a pruning mechanism to filter for semantically relevant images and achieve high classification accuracy without transmitting raw pixel data.

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you are a teacher trying to teach a class of students who are scattered across the globe. Some are in high-tech classrooms with super-fast internet; others are in remote villages with slow, crackling radio connections.

Usually, to teach them a new subject (like "How to identify different types of birds"), you would have to send them a massive textbook full of thousands of photos. But if a student only has a tiny, slow radio, sending that whole book would take weeks, or the signal might break before it arrives.

The Problem:
Sending the actual photos (the "pixels") is too heavy and expensive. Sending just the teacher's notes (a pre-trained model) doesn't work either because every student has a different type of notebook (different computer hardware) and needs to learn in their own way.

The Solution: "Pseudo-Labels as Data" (PLADA)
This paper proposes a brilliant shortcut. Instead of sending the photos, the teacher sends only the answers, assuming the students already have a giant, generic photo album in their pockets.

Here is how it works, broken down into simple steps:

1. The Shared Photo Album (The Reference Dataset)

Imagine every student already has a copy of a massive, famous photo encyclopedia (like ImageNet) sitting on their hard drive. It has 14 million pictures of everything: cats, cars, flowers, mountains, and clouds. They don't need to download this; it's already there.

2. The "Cheat Sheet" (The Payload)

The teacher wants the students to learn about Birds.
Instead of sending 1,000 new bird photos, the teacher opens the students' existing encyclopedia, finds the 1,000 pictures that look most like birds, and writes down a tiny list of labels:

  • Photo #4,502: "Robin"
  • Photo #12,100: "Eagle"
  • Photo #8,999: "Sparrow"

This list of labels is incredibly small. It's like sending a text message instead of a DVD. The whole message might be less than 1 Megabyte (smaller than a single low-res photo).

3. The "Smart Filter" (Pruning)

Here is the tricky part: The encyclopedia has 14 million photos, but most of them aren't birds. If the teacher just sent labels for every photo, the list would still be too long. Plus, labeling a picture of a toaster as "Bird" would confuse the student.

So, the teacher uses a Smart Filter:

  • They look at the encyclopedia and ask, "Which of these 14 million photos does my 'Bird Expert' AI think looks most like a bird?"
  • They keep only the top 1% (the best matches) and throw away the rest.
  • The Result: The teacher only sends labels for the 140,000 photos that actually look like birds. The rest are ignored.

4. The "Safety Net" (Fixing Imbalance)

Sometimes, the filter gets lazy. It might pick 100 photos of "Robins" but zero photos of "Ostriches" because Robins are easier to spot. This would make the student bad at recognizing Ostriches.

To fix this, the teacher uses a Safety Net: They force the filter to pick at least a few photos for every type of bird, even the rare or hard-to-find ones. This ensures the student learns a balanced lesson.

5. The Result

The student receives a tiny text file (less than 1 MB). They open their local encyclopedia, look up the specific photos mentioned in the text file, and use the labels to train their own "Bird Expert" model.

Why is this amazing?

  • Speed: Sending 1 MB takes seconds, even on a slow underwater cable or a rover on Mars.
  • Quality: Surprisingly, the students learn almost as well as if they had the original photos. The "best" photos from the generic encyclopedia are often good enough to teach the specific task.
  • Efficiency: It turns a massive data transfer problem into a tiny text message problem.

The Analogy in a Nutshell

Think of it like a Scavenger Hunt.

  • Old Way: The organizer sends you a box of 1,000 specific items you need to find. (Heavy, slow).
  • New Way (PLADA): You already have a giant warehouse of random junk. The organizer just sends you a tiny note saying: "Go to shelf 4, bin 12, and call that item 'Gold'. Go to shelf 9, bin 5, and call that item 'Silver'."

You use your own warehouse and the tiny note to learn the game. The note is so small it fits in a pocket, but it teaches you everything you need to know.

The Bottom Line:
This paper proves that for many tasks, you don't need to send the images to teach a computer. You just need to send the names of the images, provided the computer already has a huge library of images to choose from. It's a massive win for saving bandwidth and energy.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →