Rethinking Representativeness and Diversity in Dynamic Data Selection

Imagine you are a chef trying to teach a new apprentice how to cook a perfect meal. You have a massive library of 10,000 recipes (the dataset). If you make the apprentice read every single recipe from start to finish, it will take forever, and they might get bored or confused by the sheer volume.

The goal of Dynamic Data Selection is to pick the best recipes to teach the apprentice, so they learn faster without losing quality.

However, previous methods had two main problems:

The "Geography" Problem: They picked recipes based on how "average" they looked compared to others (like picking the recipe that is right in the middle of the library). This missed important but unique details.
The "Favorite Student" Problem: They kept picking the same few "hard" or "interesting" recipes over and over again because the apprentice struggled with them. This meant the apprentice never learned the other 9,000 recipes, leading to a biased, one-sided understanding.

This paper proposes a new way to pick recipes by rethinking two concepts: Representativeness and Diversity.

1. Rethinking "Representativeness": The "Common Ingredients" Rule

Old Way: "Pick the recipe that looks most like the average recipe in the library."
New Way: "Pick the recipes that contain the most common, essential ingredients found in all good meals."

The Analogy: Imagine you are teaching someone to bake. Instead of picking the "average" cake, you want to make sure they master the basics: flour, sugar, and eggs. These are the "high-frequency factors."
How they do it: The authors use a special tool called a Sparse Autoencoder (think of it as a super-smart librarian). This librarian breaks every recipe down into tiny "ingredients" (features). It then counts which ingredients appear most often across the whole library.
The Result: The system prioritizes recipes that cover these common, essential ingredients first. This ensures the apprentice builds a solid foundation of what "cooking" actually means, rather than just memorizing the middle of the library.

2. Rethinking "Diversity": The "Rotating Menu" Rule

Old Way: "Pick the most different recipes from the ones you just picked." (This often leads to picking the same few weird recipes repeatedly).
New Way: "Over the course of the training, make sure the apprentice sees every type of ingredient, even the rare ones, by rotating the menu."

The Analogy: If you only teach the apprentice how to make "Spicy Tacos" because they are hard to get right, they will never learn to make "Sushi" or "Pasta."
The Solution: The authors introduce a "Usage-Frequency Penalty."
- Imagine a scoreboard. Every time a recipe is picked, its score goes down slightly.
- If a recipe has been picked too many times, it becomes "tired" and gets pushed to the back of the line.
- This forces the system to rotate through the library, ensuring that even the rare, weird, or difficult recipes get a turn. This prevents the "Favorite Student" problem where the same few examples dominate the training.

3. The "Curriculum": The Smart Schedule

You don't teach a beginner chef the same way you teach an expert.

Early Training (The Foundation): The system focuses heavily on Representativeness. It picks the "common ingredient" recipes to build a strong base.
Late Training (The Polish): As the apprentice gets better, the system shifts focus to Diversity. It starts rotating in the rare, tricky recipes to refine the skills and fill in the gaps.

This shift is handled by a smooth Scheduler, like a dimmer switch that slowly turns down the "common" light and turns up the "rare" light over time.

The Big Picture: Why This Matters

By combining these ideas, the authors created a system that:

Speeds up training: It skips the boring or redundant parts of the library.
Keeps accuracy high: It doesn't miss the important stuff because it focuses on "common factors" first.
Avoids bias: It forces the system to look at the whole library, not just the same few pages.

The Result: In their experiments, this method trained models 2 times faster than standard methods while achieving the same (or even better) accuracy. It's like getting a master chef's education in half the time by using a smarter, rotating curriculum that covers all the bases without getting stuck on the same few lessons.

1. Problem Statement

Deep learning models often require massive datasets, leading to prohibitive computational costs. Data selection aims to train models on a smaller, high-value subset to reduce costs while maintaining accuracy.

Static Selection: Selects a fixed subset before training begins. Its limitation is a mismatch between the static data view and the model's evolving learning needs, often requiring large subsets (e.g., 70% of data) to maintain accuracy.
Dynamic Selection: Adapts the subset during training based on model signals (e.g., loss, gradients). However, existing dynamic methods face two core challenges:
1. Misguided Representativeness: They rely on local geometric centrality (e.g., distance to cluster centers), which may miss important global, high-frequency feature factors (e.g., material attributes) that are not geometrically central.
2. Process-Level Monopoly: Without process-level constraints, dynamic selection often greedily re-selects a small set of high-scoring samples across epochs. This creates a "sample monopoly," shifting the effective sampling distribution and introducing gradient bias relative to full-data risk minimization.

2. Methodology

The authors propose a Dynamic Data Selection Framework that redefines representativeness and diversity and couples them with a curriculum scheduler. The framework consists of three main components:

A. Redefining Representativeness: High-Frequency Feature Coverage

Instead of geometric centrality, representativeness is defined as the coverage of dataset-level common or high-frequency feature factors.

Implementation: A Sparse Autoencoder (SAE) is trained on a plug-in feature space (default: CLIP embeddings) to produce sparse unit activations.
Scoring:
- Identify the top- $K$ sparse units with the highest activation frequencies across the dataset as "common factors."
- Score a sample based on how many of these common factors it activates.
- Weighting: Factors are weighted by the inverse of their class coverage ( $w_j = 1/c_j$ ) to avoid over-valuing ubiquitous factors shared across many classes.
- Formula: $Rep(i) = \sum_{j \in F} I(a_{ij} > 0) \cdot w_j$ .

B. Redefining Diversity: Process-Level Rotation

Diversity is treated as a training-process constraint rather than just dispersion within a single subset. The goal is to ensure the cumulative selected data over time includes complementary rare factors.

Rare-Factor Diversity: Samples are scored based on the rarity of the sparse units they activate ( $Div(i) = \sum Rar(n_j)$ ).
Usage-Frequency Penalty: To prevent sample monopoly, a penalty is applied based on how often a sample has been selected in previous epochs.
- Penalty: $Pen(i, t) = \lambda \log(1 + u_i(t))$ , where $u_i(t)$ is the cumulative selection count.
- Theoretical Guarantee: The paper proves an Anti-Monopoly Theorem, showing that sufficiently over-sampled instances will eventually lose priority in the ranking, forcing rotation and ensuring balanced inclusion frequencies ( $\pi_i \approx K/N$ ).

C. Curriculum Scheduler

A lightweight scheduler balances the two scores over time without requiring model gradients or second-order computations.

Mechanism: A sigmoid function $\alpha(t)$ transitions the focus from Representativeness (early training) to Diversity/Rotation (later training).
Final Score: $H(i, t) = \alpha(t) Rep(i) + (1-\alpha(t)) Div(i) - Pen(i, t)$ .
Refinement: The final epochs can optionally switch to full-data training to reduce residual bias.

3. Key Contributions

Rethinking Representativeness: Shifts the definition from local geometric proximity to dataset-level coverage of high-frequency feature factors using sparse-unit activations.
Rethinking Diversity: Introduces Process-Level Diversity, enforcing sample rotation over epochs via a usage-frequency penalty. This provides a theoretical guarantee against sample monopoly and reduces gradient bias.
Model-Agnostic Framework: The scoring module (SAE on a plug-in feature space) is decoupled from the downstream model, allowing the method to transfer across different architectures (CNNs, ViTs) and modalities (Vision, Text).
Efficiency: The method requires only one-time offline SAE training and lightweight online updates (penalty and scheduler), avoiding expensive gradient calculations or influence estimates during training.

4. Experimental Results

The method was evaluated on five benchmarks: CIFAR-10, CIFAR-100, Tiny-ImageNet, ImageNet-1K, and RSD 15K (text classification).

Accuracy vs. Efficiency:
- On CIFAR-10/100, the method matches or exceeds full-data accuracy with >2x training acceleration (e.g., 30% selection ratio).
- On ImageNet-1K (ResNet-34), it achieves 73.8% Top-1 accuracy at a 30% selection ratio, outperforming full-data training (73.1%) and all baselines.
- On RSD 15K (RoBERTa), it improves accuracy over static and dynamic baselines.
Comparison with SOTA: Consistently outperforms strong baselines like InfoBatch, RCAP, MoDS, and K-Center, particularly in low selection ratio regimes (30%).
Ablation Studies:
- Removing the Representativeness score hurts early convergence.
- Removing Diversity (rare factors) hurts final performance.
- Removing the Usage-Frequency Penalty leads to significant performance drops due to sample monopoly.
- The Curriculum Scheduler is crucial; fixed mixing performs worse than the dynamic transition.
Robustness: The method is robust to label noise (20% noise on CIFAR-100) and various learning rate schedulers. It also transfers effectively across different backbone architectures (ResNet, VGG, ViT).

5. Significance

This work provides a principled alternative to existing dynamic data selection methods by moving away from instantaneous, model-dependent signals (like loss) toward semantic, dataset-level statistics (via SAEs).

Theoretical Insight: It formally links sample rotation to the reduction of gradient bias, offering a theoretical basis for why dynamic selection often fails without diversity constraints.
Practical Impact: It enables aggressive data reduction (training on <30% of data) without sacrificing accuracy, significantly lowering the cost of training large models.
Generalizability: The "plug-in" nature of the feature extractor and the model-agnostic scoring mechanism make it applicable to a wide range of tasks and modalities, including vision and language.

Rethinking Representativeness and Diversity in Dynamic Data Selection

1. Rethinking "Representativeness": The "Common Ingredients" Rule

2. Rethinking "Diversity": The "Rotating Menu" Rule

3. The "Curriculum": The Smart Schedule

The Big Picture: Why This Matters

1. Problem Statement

2. Methodology

A. Redefining Representativeness: High-Frequency Feature Coverage

B. Redefining Diversity: Process-Level Rotation

C. Curriculum Scheduler

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems