Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Imagine you are trying to teach a brilliant but inexperienced apprentice (a Large Language Model, or LLM) how to be a master chef. The apprentice is smart, but their current skills are limited because the recipe books they've been reading are messy, full of typos, missing ingredients, and sometimes even dangerous instructions.

This paper, "Rewriting Pre-Training Data Boosts LLM Performance in Math and Code," is about a team of researchers who decided: "Instead of just throwing away the bad recipe books, let's hire a team of expert editors to fix them, clean them up, and rewrite them so they are perfect."

Here is the story of how they did it, broken down into simple concepts.

1. The Problem: The "Noisy" Library

Currently, AI models learn by reading massive amounts of data from the internet. Think of this data as a giant, chaotic library.

The Code Library: It has millions of Python scripts, but many are broken, have bad formatting, or are missing crucial context (like a recipe that says "add salt" but doesn't say how much or when).
The Math Library: It has math problems, but they are buried under website ads, timestamps, and confusing formatting.

Previous methods tried to fix this by filtering: they would just throw away the bad books and keep only the "good" ones. But this is like throwing away 90% of the library because a few pages are torn. You lose a lot of potential knowledge.

2. The Solution: The "Transform-and-Retain" Method

The researchers, from Tokyo Tech and AIST, introduced a new philosophy: Don't just filter; fix.

They created two new datasets: SwallowCode (for programming) and SwallowMath (for math). Instead of deleting the messy data, they used a super-smart AI (Llama-3.3) to act as a "Chief Editor" to rewrite the data into something perfect.

🐍 SwallowCode: The Code Refinery

Imagine a pile of messy, handwritten code snippets. The researchers built a four-stage assembly line to clean them up:

The Syntax Check (The Safety Inspector): First, they check if the code actually runs. If it has a typo that crashes the program, it gets tossed.
The Style Check (The Fashion Police): They use a tool called pylint to check if the code looks professional. Is the indentation right? Are the variable names clear? If it looks sloppy, it gets a "red pen" correction.
The Style Rewrite (The Editor): Here, the AI steps in. It rewrites the code to follow strict style guides (like Google's). It renames confusing variables, adds clear comments, and organizes the code so it's easy to read.
- Analogy: It's like taking a rough draft of a novel and turning it into a polished bestseller with perfect grammar and flow.
The Optimization Rewrite (The Engineer): This is the magic step. The AI looks at the logic.
- Does the code rely on a library that doesn't exist? Fix it.
- Is the algorithm slow and inefficient? Make it faster.
- Is the example too simple to be useful? Make it a real-world problem.
- Result: The code is now "self-contained" (it works on its own) and "optimized" (it's the best way to solve the problem).

The Result: They turned a messy pile of 16 billion code tokens into a pristine, high-quality dataset. When they taught an AI using this new dataset, it became 17% better at solving coding puzzles than models trained on the old, filtered data.

➗ SwallowMath: The Math Tutor

For math, the process was similar but focused on clarity.

The Problem: Math problems found online often have website headers, "Posted on: [Date]" footers, and missing steps in the solution.
The Fix: The AI acts like a strict but helpful tutor. It strips away the website junk, fills in the missing context, and rewrites the solution to be a clear, step-by-step explanation.
The Result: The AI learned to solve math problems with 12% higher accuracy on standard tests.

3. Why This Matters: The "Data Quality Gap"

The paper points out a big issue in the AI world: The biggest, most powerful AI models (like those from big tech companies) are trained on secret, high-quality data that we can't see. The open-source community is stuck with "noisy" public data.

This creates a "Data Quality Gap." It's like one student has a private tutor and a library of perfect textbooks, while another student has to study from a messy, torn-up library.

By releasing SwallowCode and SwallowMath for free, the researchers are handing the open-source community a set of "perfect textbooks." They proved that you don't need a bigger computer or a smarter model architecture; you just need better data.

4. The Takeaway: Quality Over Quantity

The most important lesson from this paper is that more data isn't always better; better data is.

Old Way: "Let's read 100 books, even if 50 of them are garbage."
New Way: "Let's take those 100 books, hire an editor to fix the bad ones, and teach the student with the 100 perfect books."

The researchers showed that by spending time and computing power to rewrite the data, they got much better results than just filtering out the bad stuff. They turned a "good enough" dataset into a "world-class" dataset, proving that if you feed an AI high-quality, clean, and logical information, it will learn to think and create much better.

In short: They didn't just clean the house; they renovated the whole building, and now the AI lives in a much better home.

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

1. The Problem: The "Noisy" Library

2. The Solution: The "Transform-and-Retain" Method

🐍 SwallowCode: The Code Refinery

➗ SwallowMath: The Math Tutor

3. Why This Matters: The "Data Quality Gap"

4. The Takeaway: Quality Over Quantity

1. Problem Statement

2. Methodology

A. SwallowCode (Code Domain)

B. SwallowMath (Math Domain)

C. Experimental Setup

3. Key Contributions

4. Results

5. Significance and Impact

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

1. The Problem: The "Noisy" Library

2. The Solution: The "Transform-and-Retain" Method

🐍 SwallowCode: The Code Refinery

➗ SwallowMath: The Math Tutor

3. Why This Matters: The "Data Quality Gap"

4. The Takeaway: Quality Over Quantity

1. Problem Statement

2. Methodology

A. SwallowCode (Code Domain)

B. SwallowMath (Math Domain)

C. Experimental Setup

3. Key Contributions

4. Results

5. Significance and Impact

More like this

IC3-Evolve: Proof-/Witness-Gated Offline LLM-Driven Heuristic Evolution for IC3 Hardware Model Checking

Structural Segmentation of the Minimum Set Cover Problem: Exploiting Universe Decomposability for Metaheuristic Optimization

To Throw a Stone with Six Birds: On Agents and Agenthood

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models