Rethinking Representativeness and Diversity in Dynamic Data Selection

This paper proposes a dynamic data selection framework that redefines representativeness and diversity through dataset-level feature coverage and process-level rare-factor exploration, utilizing sparse autoencoders and a usage-frequency penalty to achieve over 2x training acceleration while matching or exceeding full-data accuracy across vision and text tasks.

Yuzhe Zhou, Zhenglin Hua, Haiyun Guo, Yuheng Jia

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are a chef trying to teach a new apprentice how to cook a perfect meal. You have a massive library of 10,000 recipes (the dataset). If you make the apprentice read every single recipe from start to finish, it will take forever, and they might get bored or confused by the sheer volume.

The goal of Dynamic Data Selection is to pick the best recipes to teach the apprentice, so they learn faster without losing quality.

However, previous methods had two main problems:

  1. The "Geography" Problem: They picked recipes based on how "average" they looked compared to others (like picking the recipe that is right in the middle of the library). This missed important but unique details.
  2. The "Favorite Student" Problem: They kept picking the same few "hard" or "interesting" recipes over and over again because the apprentice struggled with them. This meant the apprentice never learned the other 9,000 recipes, leading to a biased, one-sided understanding.

This paper proposes a new way to pick recipes by rethinking two concepts: Representativeness and Diversity.

1. Rethinking "Representativeness": The "Common Ingredients" Rule

Old Way: "Pick the recipe that looks most like the average recipe in the library."
New Way: "Pick the recipes that contain the most common, essential ingredients found in all good meals."

  • The Analogy: Imagine you are teaching someone to bake. Instead of picking the "average" cake, you want to make sure they master the basics: flour, sugar, and eggs. These are the "high-frequency factors."
  • How they do it: The authors use a special tool called a Sparse Autoencoder (think of it as a super-smart librarian). This librarian breaks every recipe down into tiny "ingredients" (features). It then counts which ingredients appear most often across the whole library.
  • The Result: The system prioritizes recipes that cover these common, essential ingredients first. This ensures the apprentice builds a solid foundation of what "cooking" actually means, rather than just memorizing the middle of the library.

2. Rethinking "Diversity": The "Rotating Menu" Rule

Old Way: "Pick the most different recipes from the ones you just picked." (This often leads to picking the same few weird recipes repeatedly).
New Way: "Over the course of the training, make sure the apprentice sees every type of ingredient, even the rare ones, by rotating the menu."

  • The Analogy: If you only teach the apprentice how to make "Spicy Tacos" because they are hard to get right, they will never learn to make "Sushi" or "Pasta."
  • The Solution: The authors introduce a "Usage-Frequency Penalty."
    • Imagine a scoreboard. Every time a recipe is picked, its score goes down slightly.
    • If a recipe has been picked too many times, it becomes "tired" and gets pushed to the back of the line.
    • This forces the system to rotate through the library, ensuring that even the rare, weird, or difficult recipes get a turn. This prevents the "Favorite Student" problem where the same few examples dominate the training.

3. The "Curriculum": The Smart Schedule

You don't teach a beginner chef the same way you teach an expert.

  • Early Training (The Foundation): The system focuses heavily on Representativeness. It picks the "common ingredient" recipes to build a strong base.
  • Late Training (The Polish): As the apprentice gets better, the system shifts focus to Diversity. It starts rotating in the rare, tricky recipes to refine the skills and fill in the gaps.

This shift is handled by a smooth Scheduler, like a dimmer switch that slowly turns down the "common" light and turns up the "rare" light over time.

The Big Picture: Why This Matters

By combining these ideas, the authors created a system that:

  1. Speeds up training: It skips the boring or redundant parts of the library.
  2. Keeps accuracy high: It doesn't miss the important stuff because it focuses on "common factors" first.
  3. Avoids bias: It forces the system to look at the whole library, not just the same few pages.

The Result: In their experiments, this method trained models 2 times faster than standard methods while achieving the same (or even better) accuracy. It's like getting a master chef's education in half the time by using a smarter, rotating curriculum that covers all the bases without getting stuck on the same few lessons.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →