Scale Dependent Data Duplication

This paper demonstrates that data duplication is scale-dependent, revealing that as model capability and corpus size increase, semantically equivalent documents behave like exact duplicates by producing aligned gradients and causing accelerated semantic collisions, which leads to rapidly increasing training losses for larger models and necessitates new scaling laws to accurately predict performance.

Joshua Kazdan, Noam Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho

Published 2026-03-10
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Scale Dependent Data Duplication," translated into simple language with everyday analogies.

The Big Idea: The "Echo Chamber" Problem

Imagine you are teaching a child to speak. You have a huge library of books.

  • Small Child (Small Model): If you show them two books that say the same thing but use different words (e.g., "The cat is big" vs. "The feline is large"), the child sees them as two different stories. They learn two different things.
  • Smart Teenager (Large Model): As the child gets smarter, they realize those two sentences mean the exact same thing. If you show them both, they don't learn anything new from the second one. It's just an echo.

The Problem: As AI models get smarter, they start treating "semantic duplicates" (different words, same meaning) as if they were exact copies. This means that even if you have a massive dataset, a super-smart AI might feel like it's reading the same few pages over and over again.

This paper argues that bigger isn't always better if the data isn't diverse enough. In fact, for very smart models, having too much data that sounds different but means the same thing can actually hurt their performance.


Key Concepts Explained with Analogies

1. The "Gradient" (The Teacher's Nudge)

In AI training, the model makes a guess, gets it wrong, and receives a "nudge" (a mathematical signal called a gradient) to correct itself.

  • The Analogy: Imagine a student taking a test.
    • If they get a question wrong, the teacher points to the specific rule they missed.
    • Small Model: If you give the student two different questions that test the same rule, the teacher gives two different pointers because the student is confused by the wording.
    • Large Model: The smart student realizes, "Hey, these two questions are testing the exact same rule!" The teacher gives the exact same nudge for both.
  • The Finding: The paper proves that as models get bigger, their "nudges" for different-but-similar sentences become identical. They stop learning new things and just repeat the same lesson.

2. The "Library of Babel" (Semantic Collisions)

The researchers looked at a massive library of 192 million documents. They asked: "How many of these are actually unique ideas?"

  • The Analogy: Imagine a library where you have 1,000 books.
    • At first, every book seems unique.
    • But as you add more and more books (to 1 million, then 100 million), you start finding that many books are just translations of the same story, or summaries of the same news, or rewrites of the same joke.
  • The Finding: In small libraries, duplicates are rare. But in massive, web-scale libraries, "semantic collisions" (different words, same meaning) happen exponentially faster than we thought. The "unique" content runs out much sooner than the math predicted.

3. The Synthetic Data Trap

Many companies are trying to solve the "running out of human text" problem by generating new text using AI (Synthetic Data).

  • The Analogy: Imagine you are trying to teach a student by having them read books written by other students.
    • If the first student writes a book, and the second student copies the style but changes the words, the third student copies that, and so on... eventually, you have a million books that all sound like the same person.
  • The Finding: The paper tested this. Synthetic data runs out of "unique ideas" 10 times faster than real human data. If you train a giant AI on AI-generated text, it will hit a wall of repetition very quickly and stop getting smarter.

4. The "Effective Size" (The Real Lesson)

The paper introduces a new way to measure data. Instead of counting how many documents you have, you should count how many unique ideas you have.

  • The Analogy: Imagine you have a bucket of water.
    • Old Way: Counting how many drops are in the bucket.
    • New Way: Counting how many gallons of water are in the bucket.
    • If you keep adding drops of water that are already in the bucket (duplicates), the bucket doesn't get bigger.
  • The Finding: For a small model, a bucket with 10% duplicates is fine. But for a giant model, that same 10% of duplicates acts like a 50% loss in learning power. The model gets "bored" and stops improving.

Why This Matters for the Future

The "Bitter Lesson" is hitting a wall.
For years, the tech industry believed: "If we just make the model bigger and feed it more data, it will become super-intelligent."

This paper says: Not so fast.
If you feed a super-intelligent model a dataset that is full of "echoes" (semantic duplicates), it won't get smarter. It will just memorize the echoes.

The Solution:

  1. Quality over Quantity: We need to be much more careful about removing "semantic duplicates," not just exact copies.
  2. Better Synthetic Data: If we use AI to generate training data, we must ensure it has high "idea diversity," or we are just feeding the model its own voice.
  3. New Math: We need new formulas to predict how well a model will learn, taking into account that "smart" models get bored of repetitive data much faster than "dumb" models do.

Summary in One Sentence

As AI models get smarter, they stop seeing the difference between "The cat is big" and "The feline is large," turning massive datasets into small, repetitive loops that stop the AI from learning anything new.