Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Imagine you are trying to teach a brilliant but very young child (the AI) how to speak and understand the world. You decide to give them a library of every book, blog post, and forum thread ever written on the internet.

The Problem:
This library is a mess. It's full of:

Trash: Ads, "Click Here" buttons, and broken links.
Noise: Typos, gibberish, and duplicate sentences.
Confusion: Medical advice mixed with cooking recipes, or math equations buried inside HTML code.

If you just hand this messy library to the child, they will learn to speak in gibberish, repeat ads, and get confused. This is what happens when AI models are trained on "raw" internet data.

The Old Way (Manual Design):
Previously, experts tried to fix this by acting like librarians. They would manually write rules for different sections of the library:

"For the Math section, remove the ads."
"For the Medical section, keep the drug names but remove the patient names."
"For the Code section, fix the broken brackets."

But the internet is too huge. There are thousands of different "sections" (categories). Writing a unique rulebook for every single one is impossible. It takes too much time and human effort.

The New Way: DataEvolve (The "Darwinian" Approach)
This paper introduces DataEvolve, a system that doesn't just follow rules; it learns how to write the rules itself.

Think of DataEvolve as a super-intelligent, evolutionary gardener trying to grow the perfect garden (the training data) from a pile of wild, overgrown weeds.

Here is how the "Gardener" works, step-by-step:

1. The Observation (The Scout)

The gardener sends out a scout to look at a small patch of weeds (a sample of data). The scout doesn't just look; they report back: "Hey, this patch has too many plastic wrappers (HTML tags), the flowers are mixed with trash, and the math books have torn pages."

2. The Strategy (The Plan)

Based on the scout's report, the gardener writes a cleaning plan (a strategy).

Plan A: "Remove all plastic wrappers."
Plan B: "Tear out the ads but keep the flowers."

3. The Trial (The Experiment)

The gardener tries this plan on a small batch of weeds. They don't plant the whole garden yet; they just test it on a few plants to see if it works.

4. The Judgment (The Score)

A judge looks at the cleaned plants.

Did it work? Yes, the plastic is gone.
Did we lose anything good? No, the flowers are still there.
Score: 9/10.
New Discovery: The judge notices, "Oh, we also need to fix the torn pages on the math books." This new fact is added to the Experience Pool (the gardener's memory).

5. Evolution (The Improvement)

This is the magic part. The gardener doesn't stop. They take the best plan from the last round, tweak it based on the judge's feedback, and try again.

Generation 1: "Remove ads." (Score: 6/10)
Generation 10: "Remove ads and fix torn pages." (Score: 8/10)
Generation 30: "Remove ads, fix torn pages, and specifically protect medical drug names." (Score: 9.5/10)

Over 30 rounds, the strategy evolves. It gets smarter, more specific, and better at its job without a human ever telling it exactly what to do. The "bad" strategies die out, and the "good" ones survive and reproduce.

The Result: Darwin-CC

After running this evolutionary process for 8 different types of data (Math, Medicine, Coding, etc.), the team created Darwin-CC.

It's not a rewrite: Unlike other methods that try to rewrite everything into a perfect "textbook style" (which can lose the original flavor), DataEvolve's strategies focus on cleaning. It's like taking a dirty, muddy shirt and washing it until it's clean, rather than melting the shirt down to make a new one.
The Proof: When they trained a small AI model on this clean, evolved data, the model became significantly smarter.
- It got much better at answering hard science questions (MMLU).
- It got much better at medical questions (MedQA).
- It outperformed other famous datasets that were manually curated by humans.

The Big Takeaway

The paper proves that we don't need a team of thousands of human experts to write rulebooks for every type of data on the internet. Instead, we can build a system that evolves its own cleaning rules.

Just like nature evolves species to fit their environment, DataEvolve evolves data-cleaning strategies to fit the specific "environment" of each data category. It turns the impossible task of manually cleaning the entire internet into an automated, self-improving process.

In short: They taught the AI how to clean its own house, and the house turned out to be much better than if humans had tried to clean it manually.

1. Problem Statement

Modern Large Language Models (LLMs) rely on pretraining data of trillions of tokens, typically sourced from web crawls. While high-quality curated datasets exist, they are limited in scale. Current data curation strategies face two critical bottlenecks:

Scalability of Manual Design: Modern corpora contain hundreds of heterogeneous categories (e.g., math, medicine, code, social media). Designing effective, category-specific cleaning strategies manually is prohibitively expensive and slow.
Inefficiency of Iterative Refinement: Validating a single curation strategy traditionally requires cleaning data at scale and training a model from scratch, which costs thousands of GPU hours. This makes iterative optimization (trial and error) computationally intractable.
Limitations of Current Approaches: Existing methods often rely on simple filtering (removing low-quality documents) or aggressive transformation (rewriting text into textbook styles). Filtering loses valuable content obscured by noise, while transformation can alter semantic meaning or introduce artifacts.

The core question posed is: Can data curation strategies themselves evolve autonomously through an automated, iterative process rather than manual design?

2. Methodology: The DataEvolve Framework

The authors introduce DataEvolve, a framework that treats strategy design as an evolutionary optimization problem. Instead of manual engineering, the system uses a closed-loop evolutionary process to discover, test, and refine cleaning strategies for specific data categories.

Core Components

The framework operates in a closed loop with four main agents and two memory repositories:

Data Observer: Analyzes sampled documents to identify specific quality issues (e.g., HTML noise, broken citations, formatting errors) and populates an Experience Pool.
Strategy Designer: Generates cleaning prompts (strategies) based on the accumulated issues in the Experience Pool and feedback from previous iterations. It refines the "parent" strategy (the best performer) via mutation.
Data Cleaner: Executes the generated strategy on a sample batch of documents using an LLM executor.
Quality Judge: Evaluates the (Original, Cleaned) pairs. It scores the output on a 1–10 scale, analyzes coverage (did it fix known issues?), and checks executability. Crucially, it discovers new issues, adding them back to the Experience Pool.

The Evolutionary Loop

Initialization: The system observes raw data to seed the initial knowledge base.
Evolutionary Cycle: Over $N$ $N$ generations (30 in the experiments):
1. Select the best-performing strategy from the Strategy Pool.
2. Generate a refined variant based on diagnostic feedback.
3. Execute on new samples and evaluate.
4. Update the Strategy Pool (keeping high performers) and Experience Pool (adding new issues).
Fitness Approximation: To avoid full model training costs, the system uses sample-based quality assessment (scoring cleaned text) as a proxy for downstream model performance.

Dataset Construction (Darwin-CC)

Source: Nemotron-CC (672B tokens).
Categories: 8 distinct categories formed by intersecting 4 domains (Math, CS, Medicine, Other STEM) and 2 quality levels (High, Not-High).
Process: DataEvolve ran for 30 iterations per category.
Output: Darwin-CC, a 504B-token dataset where each category was cleaned using its own evolved, optimized strategy.

3. Key Contributions

Automated Strategy Evolution: Demonstrated that curation strategies can evolve autonomously. By formulating strategy design as an evolutionary problem with cross-generation knowledge accumulation, the system discovers effective strategies without human intervention.
The DataEvolve Framework: Released an end-to-end system for evolutionary strategy design and the resulting Darwin-CC dataset.
Discovery of Convergence: Revealed that independently evolved strategies across diverse domains converge on a cleaning-centric paradigm (targeted noise removal + domain-aware preservation) rather than content transformation (rewriting).
Validation of Iterative Optimization: Proved that not all automated processing is equal; iteratively evolved strategies significantly outperform static or suboptimal automated strategies.

4. Experimental Results

The authors trained 3B-parameter Qwen2.5 models on 500B tokens using different datasets and evaluated them on 18 benchmarks.

Performance Comparison

vs. Raw Data: Darwin-CC outperformed raw unprocessed data by +3.96 points on average across 18 benchmarks.
vs. Suboptimal Strategies: Models trained on data cleaned with a non-evolved (suboptimal) strategy scored 41.20, while Darwin-CC scored 44.13. This 2.93-point gap confirms that iterative evolution is essential; simply using an LLM to clean data is insufficient without optimization.
vs. State-of-the-Art Corpora: Darwin-CC surpassed established datasets:
- DCLM: 44.13 vs. 42.42 (+1.71)
- Ultra-FineWeb: 44.13 vs. 36.29
- FineWeb-Edu: 44.13 vs. 36.52

Task-Specific Gains

Knowledge-Intensive Tasks: Massive improvements were seen in tasks requiring factual recall and reasoning (e.g., MMLU +18.64, MedQA +13.48, CSQA +18.80).
Learning Curves: Darwin-CC showed steeper learning curves and maintained widening advantages over raw data as training progressed, indicating higher data efficiency.
Task Degradation: Interestingly, performance on informal/commonsense tasks (e.g., HellaSwag, SIQA) slightly decreased, suggesting that cleaning removes colloquial language registers, shifting the distribution toward more formal text.

5. Analysis & Insights

Cleaning vs. Transformation: The evolved strategies did not rewrite text into "textbook" styles. Instead, they focused on targeted noise removal (HTML, ads, duplicates) and format normalization while strictly preserving domain-specific content (e.g., keeping LaTeX in math, clinical units in medicine).
Semantic Preservation: Analysis showed that DataEvolve preserves semantic similarity (0.958) far better than transformation-based methods like QA generation (0.882).
Diversity: Cleaning actually increased corpus diversity by removing boilerplate and near-duplicates, while maintaining the broad coverage of the original web data.

6. Significance

This work shifts the paradigm of data curation from manual expertise to automated evolution.

Scalability: It solves the bottleneck of designing strategies for hundreds of heterogeneous categories, making it feasible to curate massive, diverse pretraining corpora.
Cost-Efficiency: By using sample-based fitness approximation, it avoids the prohibitive cost of training models for every strategy iteration.
Quality Insight: It challenges the notion that "rewriting" data is necessary for quality, showing that systematic, domain-aware cleaning is sufficient to unlock latent data value and achieve state-of-the-art performance.

The paper concludes that evolutionary strategy design is not only feasible but necessary for the next generation of large-scale pretraining data curation.