Imagine you are trying to teach a student (an AI model) a new language (Portuguese) using a limited amount of textbooks. You have two main questions:
- Does it matter if the textbooks are good quality?
- Can a smart tutor rewrite those textbooks to make them better, even if the originals were messy?
This paper, titled "Synthetic Rewriting as a Quality Multiplier," answers these questions by running a massive experiment with AI. Here is the story of what they found, explained simply.
The Setup: The Library and the Tutor
The researchers started with a giant library of Portuguese text called ClassiCC-PT. They sorted this library into two piles:
- The "Gold" Pile: High-quality documents (like textbooks, scientific papers, and educational materials).
- The "Rough" Pile: Lower-quality documents (like messy forum posts, spam, or poorly written blogs).
They took a small, equal amount of text from each pile (10 billion words) to use as their "base curriculum."
Then, they brought in a Super Tutor (a 7-billion-parameter AI model). The Tutor's job was to take every single document from both piles and rewrite it in four different styles:
- Easy: Simplified for a child.
- Medium: Clean and encyclopedic (like Wikipedia).
- Hard: Technical and formal.
- Q&A: Turned into a quiz format.
This process created a huge amount of new synthetic data (about 40 billion words total) for the students to learn from.
The Students: Two Different AI Models
They tested this curriculum on two different "students" (AI models):
- The Small Student (1.1 Billion parameters): A smaller, less powerful brain.
- The Big Student (7 Billion parameters): A larger, smarter brain.
They taught these students using the rewritten data and then tested them on a massive Portuguese exam called PoETa V2 (44 different tasks like math, reasoning, and understanding text).
The Big Discovery: Rewriting is a "Quality Multiplier"
Here is the most important lesson from the paper, using a simple analogy:
Imagine you have a recipe.
- High-Quality Data is like a recipe written by a Michelin-star chef.
- Low-Quality Data is like a recipe scribbled on a napkin with missing ingredients.
- Rewriting is like a food critic tasting the recipe and rewriting the instructions to be clearer and more precise.
What happened with the Big Student (7B)?
- Chef's Recipe + Critic's Rewrite: The student learned incredibly well. The rewrite acted like a super-charger, boosting performance significantly. The critic made the good recipe even better.
- Napkin Recipe + Critic's Rewrite: The student barely improved. Even though the critic tried to fix the napkin recipe, the missing ingredients (bad source data) meant the final dish was still mediocre. The rewrite couldn't fix the fundamental flaws.
The Conclusion: Rewriting doesn't turn trash into treasure. Instead, it acts as a multiplier. If you start with gold, rewriting turns it into platinum. If you start with lead, rewriting just gives you slightly shinier lead.
The Twist: Size Matters (The Small Student)
When they tested the Small Student (1.1B), the rules changed.
- The Small Student didn't care much about the rewriting.
- In fact, the Small Student did just as well (or sometimes better) with the original, messy napkin recipes as it did with the rewritten, polished ones.
Why?
Think of the Small Student as a child who learns best by seeing many different examples, even if they are messy. The rewritten data was too "clean" and "structured," which actually made it less diverse. The Small Student needed the raw chaos of the internet to learn patterns, while the Big Student needed the clean, structured data to learn deep concepts.
The Verdict
- Don't try to fix bad data with AI rewriting. If your source data is low quality, rewriting it won't save you. You need to curate (select) good data first.
- Rewriting is a force multiplier for good data. If you already have high-quality Portuguese text, rewriting it in different styles makes your AI model significantly smarter.
- Bigger models need cleaner data. The larger, smarter AI models benefit the most from this "quality multiplier" effect. Smaller models might actually prefer raw, diverse data over polished rewrites.
In short: Synthetic rewriting is like a high-end photo filter. It makes a beautiful photo look stunning, but it can't turn a blurry, dark photo into a masterpiece. You still need to start with a good picture.