Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

This paper proposes USEFUL, a method that mitigates simplicity bias by strategically upsampling underrepresented data clusters identified early in training, thereby significantly improving in-distribution generalization across various architectures and datasets when combined with gradient descent or sharpness-aware minimization.

Dang Nguyen, Paymon Haddad, Eric Gan, Baharan Mirzasoleiman

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Idea: Teaching a Student to Study Smarter, Not Just Harder

Imagine you are training a student (an AI) to pass a history exam. The student has a natural tendency to be lazy: they want to memorize the easiest, most obvious facts first (like "The Civil War happened in the 1800s") and ignore the complex, nuanced details (like the specific economic causes of the war).

In the world of AI, this is called Simplicity Bias. The AI learns the "fast" features of the data first because they are easy to spot. It ignores the "slow" features because they are harder to learn. The problem? If the exam asks a tricky question that relies on those slow features, the AI fails, even though it memorized the easy stuff perfectly.

This paper introduces a new method called USEFUL (UpSample Early For Uniform Learning). It's a way to tweak the training data so the AI is forced to learn the hard stuff at the same time as the easy stuff, leading to a much smarter, more generalizable student.


The Problem: The "Easy Way Out" Trap

The Analogy: The Tourist in a New City
Imagine you are a tourist in a new city. You want to learn the layout.

  • The "Easy" Features: You immediately notice the big, bright neon signs and the main highways. These are obvious.
  • The "Slow" Features: You ignore the narrow alleyways, the local markets, and the subtle street signs because they are harder to see.

If you only learn the main highways (the easy features), you can get around the city okay during the day. But if you need to find a specific hidden café in an alleyway at night, you are lost. You haven't built a complete map.

In AI training, standard methods (like Gradient Descent) act like this tourist. They grab the easy features first. They get stuck in a "local minimum"—a solution that works okay but isn't the best possible solution.

The Discovery: The "Smart" Optimizer (SAM)

The researchers noticed a different training method called SAM (Sharpness-Aware Minimization).

  • The Analogy: SAM is like a tourist who refuses to just look at the neon signs. They deliberately wander off the main path to check the alleyways while they are still looking at the signs. They want to make sure they understand the whole city, not just the tourist traps.

Mathematically, SAM forces the AI to learn the "slow" features much earlier in the training process. This results in a "flatter" solution—a more robust understanding of the data that works better on new, unseen questions.

The Catch: SAM is computationally expensive. It takes twice as long to train because it has to "wobble" the model to check if the solution is stable.

The Solution: USEFUL (The Cheat Code)

The authors asked: Can we make the standard, fast method (SGD) act like the smart, slow method (SAM) without actually making it slow?

The Answer: Yes, by changing the training data distribution.

The Analogy: The "Heavy Lifting" Gym
Imagine the AI is a weightlifter.

  • Standard Training: The gym gives them a mix of light weights (easy features) and heavy weights (slow features). The lifter naturally picks up the light weights first because they are easy. By the time they get to the heavy weights, they are exhausted and might not lift them correctly.
  • The USEFUL Strategy:
    1. Identify the Lightweights: The trainer watches the lifter for a few days. They see which weights the lifter picks up immediately (the "fast-learnable" examples).
    2. Identify the Heavyweights: They also see which weights the lifter struggles with or ignores (the "slow-learnable" examples).
    3. The Tweak: The trainer takes the "heavyweight" examples and duplicates them. Now, for every one light weight, there are two heavy weights.
    4. The Result: The lifter is forced to pick up the heavy weights early in the workout because they appear so often. They build strength on the hard stuff while they are still fresh.

How USEFUL Works (Step-by-Step)

  1. The Warm-up: The AI trains for a short time (a few epochs).
  2. The Sorting: The AI looks at its own answers. It groups the examples into two piles:
    • Pile A (Easy): Examples the AI got right immediately (Fast-learnable).
    • Pile B (Hard): Examples the AI is still confused about or getting wrong (Slow-learnable).
  3. The Upsampling: The researchers take Pile B and copy the examples. If there were 100 hard examples, they might turn it into 200.
  4. The Restart: They throw away the original data and start training from scratch on this new, modified dataset.

Because the "hard" examples are now more frequent, the AI learns them faster and more evenly. It doesn't skip the hard stuff.

The Results: Why It Matters

The paper tested this on famous image datasets (like CIFAR-10, which has pictures of cats, dogs, cars, etc.).

  • Better Grades: When they used USEFUL, the AI got significantly higher scores on the final test, even on data it had never seen before.
  • Beating the Experts: In many cases, using USEFUL with the standard, fast method (SGD) performed just as well as, or even better than, the expensive, slow method (SAM).
  • The "State-of-the-Art": When they combined USEFUL with other tricks (like data augmentation), they achieved the best possible results ever recorded for certain models on these datasets.

Summary

The Paper in One Sentence:
We found a way to trick AI into learning the hard, complex parts of a problem early on by simply showing it those hard examples more often, which makes the AI smarter and more accurate without needing expensive computing power.

The Takeaway:
Just like a student who studies the difficult chapters first while they are fresh, an AI that learns "slow" features early creates a more robust, reliable, and intelligent model. USEFUL is the tool that makes that happen.