Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

The Big Idea: Teaching a Student to Study Smarter, Not Just Harder

Imagine you are training a student (an AI) to pass a history exam. The student has a natural tendency to be lazy: they want to memorize the easiest, most obvious facts first (like "The Civil War happened in the 1800s") and ignore the complex, nuanced details (like the specific economic causes of the war).

In the world of AI, this is called Simplicity Bias. The AI learns the "fast" features of the data first because they are easy to spot. It ignores the "slow" features because they are harder to learn. The problem? If the exam asks a tricky question that relies on those slow features, the AI fails, even though it memorized the easy stuff perfectly.

This paper introduces a new method called USEFUL (UpSample Early For Uniform Learning). It's a way to tweak the training data so the AI is forced to learn the hard stuff at the same time as the easy stuff, leading to a much smarter, more generalizable student.

The Problem: The "Easy Way Out" Trap

The Analogy: The Tourist in a New City
Imagine you are a tourist in a new city. You want to learn the layout.

The "Easy" Features: You immediately notice the big, bright neon signs and the main highways. These are obvious.
The "Slow" Features: You ignore the narrow alleyways, the local markets, and the subtle street signs because they are harder to see.

If you only learn the main highways (the easy features), you can get around the city okay during the day. But if you need to find a specific hidden café in an alleyway at night, you are lost. You haven't built a complete map.

In AI training, standard methods (like Gradient Descent) act like this tourist. They grab the easy features first. They get stuck in a "local minimum"—a solution that works okay but isn't the best possible solution.

The Discovery: The "Smart" Optimizer (SAM)

The researchers noticed a different training method called SAM (Sharpness-Aware Minimization).

The Analogy: SAM is like a tourist who refuses to just look at the neon signs. They deliberately wander off the main path to check the alleyways while they are still looking at the signs. They want to make sure they understand the whole city, not just the tourist traps.

Mathematically, SAM forces the AI to learn the "slow" features much earlier in the training process. This results in a "flatter" solution—a more robust understanding of the data that works better on new, unseen questions.

The Catch: SAM is computationally expensive. It takes twice as long to train because it has to "wobble" the model to check if the solution is stable.

The Solution: USEFUL (The Cheat Code)

The authors asked: Can we make the standard, fast method (SGD) act like the smart, slow method (SAM) without actually making it slow?

The Answer: Yes, by changing the training data distribution.

The Analogy: The "Heavy Lifting" Gym
Imagine the AI is a weightlifter.

Standard Training: The gym gives them a mix of light weights (easy features) and heavy weights (slow features). The lifter naturally picks up the light weights first because they are easy. By the time they get to the heavy weights, they are exhausted and might not lift them correctly.
The USEFUL Strategy:
1. Identify the Lightweights: The trainer watches the lifter for a few days. They see which weights the lifter picks up immediately (the "fast-learnable" examples).
2. Identify the Heavyweights: They also see which weights the lifter struggles with or ignores (the "slow-learnable" examples).
3. The Tweak: The trainer takes the "heavyweight" examples and duplicates them. Now, for every one light weight, there are two heavy weights.
4. The Result: The lifter is forced to pick up the heavy weights early in the workout because they appear so often. They build strength on the hard stuff while they are still fresh.

How USEFUL Works (Step-by-Step)

The Warm-up: The AI trains for a short time (a few epochs).
The Sorting: The AI looks at its own answers. It groups the examples into two piles:
- Pile A (Easy): Examples the AI got right immediately (Fast-learnable).
- Pile B (Hard): Examples the AI is still confused about or getting wrong (Slow-learnable).
The Upsampling: The researchers take Pile B and copy the examples. If there were 100 hard examples, they might turn it into 200.
The Restart: They throw away the original data and start training from scratch on this new, modified dataset.

Because the "hard" examples are now more frequent, the AI learns them faster and more evenly. It doesn't skip the hard stuff.

The Results: Why It Matters

The paper tested this on famous image datasets (like CIFAR-10, which has pictures of cats, dogs, cars, etc.).

Better Grades: When they used USEFUL, the AI got significantly higher scores on the final test, even on data it had never seen before.
Beating the Experts: In many cases, using USEFUL with the standard, fast method (SGD) performed just as well as, or even better than, the expensive, slow method (SAM).
The "State-of-the-Art": When they combined USEFUL with other tricks (like data augmentation), they achieved the best possible results ever recorded for certain models on these datasets.

Summary

The Paper in One Sentence:
We found a way to trick AI into learning the hard, complex parts of a problem early on by simply showing it those hard examples more often, which makes the AI smarter and more accurate without needing expensive computing power.

The Takeaway:
Just like a student who studies the difficult chapters first while they are fresh, an AI that learns "slow" features early creates a more robust, reliable, and intelligent model. USEFUL is the tool that makes that happen.

1. Problem Statement

The paper addresses a fundamental question in machine learning: Can modifying the distribution of clean training data improve the in-distribution (ID) generalization performance of models trained with standard optimizers?

Context: While most research focuses on better architectures or optimization methods (like Sharpness-Aware Minimization, SAM) to find flatter minima, this work investigates whether the data distribution itself can be altered to guide optimizers toward better solutions.
The Core Issue (Simplicity Bias): Standard Gradient Descent (GD) and Stochastic Gradient Descent (SGD) exhibit an inductive bias known as "simplicity bias." They tend to learn "fast-learnable" features (simple, high-signal patterns) early in training and ignore or learn "slow-learnable" features (complex, lower-signal patterns) much later.
Consequence: This bias causes models to overfit to the easy features and fail to generalize to examples where those features are missing or ambiguous, even within the same data distribution. While SAM has been shown to mitigate this bias, it comes with a high computational cost (doubling training time). The authors ask if they can achieve similar benefits for standard GD/SAM by simply changing the data distribution.

2. Methodology: USEFUL

The authors propose a method called UpSample Early For Uniform Learning (USEFUL). The core intuition is that by increasing the frequency of "slow-learnable" examples early in training, the optimizer is forced to learn these features at a speed comparable to fast-learnable ones, thereby reducing simplicity bias.

The Algorithm (USEFUL):

Early Training Phase: Train the model on the original dataset for a short period (epoch $t$ ). Theoretical analysis suggests this should be done when the training error curve begins to flatten, indicating fast features have been learned.
Clustering: For each class, cluster the training examples based on the model's output vectors (logits) using $k$ $k$ -means ( $k=2$ $k = 2$ ).
- Cluster 1 (Fast-learnable): Examples with low loss/high confidence (easily classified).
- Cluster 2 (Slow-learnable): Examples with higher loss/lower confidence (harder to classify).
One-Shot Upsampling: Upsample the "slow-learnable" cluster (Cluster 2) by a factor $k$ (empirically $k=2$ ). This creates a modified dataset where difficult examples appear more frequently.
Retraining: Restart training from scratch (or continue) on this modified distribution.

Key Distinction: Unlike dynamic reweighting or iterative importance sampling, USEFUL performs the clustering and upsampling only once at an early stage and then trains on a fixed, modified distribution.

3. Theoretical Analysis

The paper provides rigorous theoretical proofs using a two-layer nonlinear Convolutional Neural Network (CNN) trained on a synthetic data distribution containing:

Fast-learnable features: Present in a fraction $\alpha$ of examples with high strength $\beta_e$ .
Slow-learnable features: Present in all examples but with lower strength $\beta_d$ .
Noise: Gaussian noise patches.

Key Theoretical Findings:

GD Dynamics: The authors prove that GD learns fast features significantly faster than slow features. Early in training, the model relies almost exclusively on fast features, ignoring slow ones.
SAM Dynamics: SAM learns both features more uniformly. The gap between the learning speed of fast and slow features is smaller in SAM than in GD ( $S^{(t)}_e - S^{(t)}_d < G^{(t)}_e - G^{(t)}_d$ ).
The Role of Upsampling: The authors prove that increasing the strength (or frequency) of the slow-learnable feature (via upsampling) increases its contribution to the gradient.
Equivalence: They demonstrate that there exists an upsampling factor $k$ such that training GD on the upsampled data recovers the normalized gradient dynamics of SAM on the original data. This theoretically justifies why USEFUL can mimic SAM's benefits without the computational overhead of the perturbation step.

4. Key Contributions

First Analysis of Data Distribution Modification for ID Generalization: Unlike prior work focusing on distribution shift or long-tail problems, this work targets standard in-distribution generalization by altering the training distribution to reduce simplicity bias.
Theoretical Proof of SAM's Uniform Learning: The paper rigorously proves that SAM learns features more uniformly than GD in a two-layer CNN, explaining its superior generalization via reduced simplicity bias.
USEFUL Algorithm: A lightweight, one-shot method to identify and upsample "hard" examples early in training, effectively reducing simplicity bias.
State-of-the-Art (SOTA) Performance: The method achieves SOTA results on multiple benchmarks when combined with existing optimizers and augmentation strategies.

5. Experimental Results

The authors evaluated USEFUL on CIFAR-10, CIFAR-100, STL-10, CINIC-10, and Tiny-ImageNet using various architectures (ResNet18/34, VGG19, DenseNet121, ViT, MLP).

Performance Gains:
- ResNet18 on CIFAR-10: Combined with SAM and TrivialAugment (TA), USEFUL achieved 3.49% test error (SOTA), compared to 4.23% for SAM alone.
- ResNet34 on CIFAR-100: Achieved 17.72% error (SAM+USEFUL+TA) vs 20.4% for SAM alone.
- Generalization: USEFUL improved performance for both SGD and SAM across all datasets. Notably, on STL-10, SGD+USEFUL outperformed SAM alone.
Efficiency: USEFUL adds minimal computational cost (approx. 1.3x to 1.6x training time of SGD) compared to SAM (2x training time).
Mechanism Validation:
- Sparsity: Models trained with USEFUL exhibited lower L1 norms (sparser solutions), similar to SAM.
- Sharpness: USEFUL reduced the sharpness of the minima (lower Hessian eigenvalues) compared to standard SGD.
- Forgetting Scores: USEFUL reduced the "forgetting score" (number of times an example is misclassified after being correct), indicating faster and more stable learning of difficult examples.
Robustness: The method showed benefits in Out-of-Distribution (OOD) settings (e.g., Waterbirds dataset) and long-tail distributions, though the primary focus was ID generalization.

6. Significance

This work bridges the gap between optimization theory and data-centric AI.

Theoretical Insight: It provides a mechanistic explanation for why SAM works (uniform feature learning) and proves that this behavior can be induced in standard GD by manipulating data frequency.
Practical Impact: USEFUL offers a computationally efficient alternative to SAM. It allows practitioners to achieve SAM-like generalization benefits using standard SGD or existing SAM variants without doubling training time.
Generalizability: The method is architecture-agnostic (working on CNNs, Transformers, and MLPs) and compatible with standard data augmentation pipelines (like TrivialAugment).

In summary, the paper demonstrates that simplicity bias is a data-frequency issue that can be mitigated by strategically oversampling difficult examples early in training, leading to more uniform feature learning and superior generalization.