Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

This paper introduces DataEvolve, an automated framework that iteratively evolves data curation strategies for heterogeneous pretraining corpora, resulting in the Darwin-CC dataset which significantly outperforms existing benchmarks and demonstrates that evolutionary optimization is essential for large-scale data processing.

Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu

Published 2026-03-17
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a brilliant but very young child (the AI) how to speak and understand the world. You decide to give them a library of every book, blog post, and forum thread ever written on the internet.

The Problem:
This library is a mess. It's full of:

  • Trash: Ads, "Click Here" buttons, and broken links.
  • Noise: Typos, gibberish, and duplicate sentences.
  • Confusion: Medical advice mixed with cooking recipes, or math equations buried inside HTML code.

If you just hand this messy library to the child, they will learn to speak in gibberish, repeat ads, and get confused. This is what happens when AI models are trained on "raw" internet data.

The Old Way (Manual Design):
Previously, experts tried to fix this by acting like librarians. They would manually write rules for different sections of the library:

  • "For the Math section, remove the ads."
  • "For the Medical section, keep the drug names but remove the patient names."
  • "For the Code section, fix the broken brackets."

But the internet is too huge. There are thousands of different "sections" (categories). Writing a unique rulebook for every single one is impossible. It takes too much time and human effort.

The New Way: DataEvolve (The "Darwinian" Approach)
This paper introduces DataEvolve, a system that doesn't just follow rules; it learns how to write the rules itself.

Think of DataEvolve as a super-intelligent, evolutionary gardener trying to grow the perfect garden (the training data) from a pile of wild, overgrown weeds.

Here is how the "Gardener" works, step-by-step:

1. The Observation (The Scout)

The gardener sends out a scout to look at a small patch of weeds (a sample of data). The scout doesn't just look; they report back: "Hey, this patch has too many plastic wrappers (HTML tags), the flowers are mixed with trash, and the math books have torn pages."

2. The Strategy (The Plan)

Based on the scout's report, the gardener writes a cleaning plan (a strategy).

  • Plan A: "Remove all plastic wrappers."
  • Plan B: "Tear out the ads but keep the flowers."

3. The Trial (The Experiment)

The gardener tries this plan on a small batch of weeds. They don't plant the whole garden yet; they just test it on a few plants to see if it works.

4. The Judgment (The Score)

A judge looks at the cleaned plants.

  • Did it work? Yes, the plastic is gone.
  • Did we lose anything good? No, the flowers are still there.
  • Score: 9/10.
  • New Discovery: The judge notices, "Oh, we also need to fix the torn pages on the math books." This new fact is added to the Experience Pool (the gardener's memory).

5. Evolution (The Improvement)

This is the magic part. The gardener doesn't stop. They take the best plan from the last round, tweak it based on the judge's feedback, and try again.

  • Generation 1: "Remove ads." (Score: 6/10)
  • Generation 10: "Remove ads and fix torn pages." (Score: 8/10)
  • Generation 30: "Remove ads, fix torn pages, and specifically protect medical drug names." (Score: 9.5/10)

Over 30 rounds, the strategy evolves. It gets smarter, more specific, and better at its job without a human ever telling it exactly what to do. The "bad" strategies die out, and the "good" ones survive and reproduce.

The Result: Darwin-CC

After running this evolutionary process for 8 different types of data (Math, Medicine, Coding, etc.), the team created Darwin-CC.

  • It's not a rewrite: Unlike other methods that try to rewrite everything into a perfect "textbook style" (which can lose the original flavor), DataEvolve's strategies focus on cleaning. It's like taking a dirty, muddy shirt and washing it until it's clean, rather than melting the shirt down to make a new one.
  • The Proof: When they trained a small AI model on this clean, evolved data, the model became significantly smarter.
    • It got much better at answering hard science questions (MMLU).
    • It got much better at medical questions (MedQA).
    • It outperformed other famous datasets that were manually curated by humans.

The Big Takeaway

The paper proves that we don't need a team of thousands of human experts to write rulebooks for every type of data on the internet. Instead, we can build a system that evolves its own cleaning rules.

Just like nature evolves species to fit their environment, DataEvolve evolves data-cleaning strategies to fit the specific "environment" of each data category. It turns the impossible task of manually cleaning the entire internet into an automated, self-improving process.

In short: They taught the AI how to clean its own house, and the house turned out to be much better than if humans had tried to clean it manually.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →