mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Imagine you have a brilliant, super-smart robot tutor (a Large Language Model) that has been trained to solve incredibly difficult math problems. But there's a catch: this robot only speaks English.

If you want to teach this robot to help students in Tokyo, Berlin, or São Paulo, you can't just ask it to "speak their language." You have to teach it how to think in those languages. Currently, the best training materials for these robots are all in English. It's like trying to teach a chef French cuisine using only a cookbook written in English, hoping they'll figure out the French terms on their own. It doesn't work well.

This paper introduces mAceReason-Math, a massive new library of math problems designed to fix this. Here is the story of how they built it, explained simply:

1. The Problem: The "English-Only" Barrier

For a few years, researchers have been using a special training method called RLVR (Reinforcement Learning with Verifiable Rewards). Think of this as a video game where the robot tries to solve a math problem, gets a "gold star" if it's right, and learns from its mistakes.

To make the robot really good, the problems need to be just the right difficulty—like a video game level that is challenging but not impossible. The best "levels" (datasets) exist, but they are all in English. Previous attempts to translate these problems were either too easy (like kindergarten math) or too messy to be useful for training a super-advanced robot.

2. The Solution: A High-Quality Translation Factory

The team created a dataset with over 140,000 math problems translated into 14 different languages (including German, Chinese, Japanese, Russian, and even Swahili).

But they didn't just use a basic translator app. They built a hybrid translation factory with three strict steps:

Step 1: The Cleanup Crew (Filtering)
Before translating, they had to clean the original English problems. Imagine a pile of old homework papers. Some had the answers written on the back, some had torn pages, and some referenced diagrams that were missing.
- They threw out the "broken" papers (about 4%).
- They fixed the "messy" papers (about 11%) by removing weird formatting or instructions like "Task 5.4" that didn't belong in the actual question.
Step 2: The AI Translators (The First Draft)
They used a powerful AI (Claude Sonnet 4) to translate the cleaned problems. But AI can be tricky with math. It might accidentally change a number or translate a formula like $x^2$ into a word, which breaks the math.
- They gave the AI strict rules: "Do not touch the numbers or the formulas. Only translate the words around them."
Step 3: The Human Editors (The Final Polish)
This is the most important part. They hired native speakers (real humans who grew up speaking these languages) to review the AI's work.
- They checked: "Does this sound like a real math problem a student in Japan would see?"
- They caught subtle errors, like using the wrong word for "sequence" in German or formatting numbers incorrectly (e.g., using a comma instead of a dot for decimals).
- If a translation was bad, the AI tried again, and the humans checked it again, up to five times, until it was perfect.

3. The Result: A Global Math Gym

The final product is a massive, high-quality gym for math robots.

The Parallel Set: They have a "gold standard" set of 7,620 problems where every single problem exists in all 14 languages. This allows researchers to compare exactly how well a robot performs in French vs. Chinese using the exact same test.
The Full Set: They also have over 10,000 problems per language, giving researchers plenty of data to train their models.

4. Why This Matters

The paper tested several robots on this new dataset. They found that:

Bigger, smarter robots generally did better.
However, performance varied wildly by language. Some robots were great at English math but stumbled when the problem was in Thai or Portuguese.
This dataset is now a benchmark. It's the new "standardized test" that researchers will use to see if their multilingual math robots are actually getting smarter.

The Big Picture Analogy

Think of the previous English-only datasets as a monolingual library. You could only learn math if you spoke English.

mAceReason-Math is like building a global university where the same advanced math curriculum is available in 14 different languages, with the same high-quality textbooks, the same difficult exams, and the same rigorous grading. Now, researchers can finally teach their AI students to think logically, no matter what language they speak.

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

1. The Problem: The "English-Only" Barrier

2. The Solution: A High-Quality Translation Factory

3. The Result: A Global Math Gym

4. Why This Matters

The Big Picture Analogy

1. Problem Statement

2. Methodology

A. Data Cleaning and Filtering (Base: AceReason-Math)

B. Hybrid Translation Pipeline

C. Dataset Compilation

3. Key Contributions

4. Results and Evaluation

5. Significance

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

1. The Problem: The "English-Only" Barrier

2. The Solution: A High-Quality Translation Factory

3. The Result: A Global Math Gym

4. Why This Matters

The Big Picture Analogy

1. Problem Statement

2. Methodology

A. Data Cleaning and Filtering (Base: AceReason-Math)

B. Hybrid Translation Pipeline

C. Dataset Compilation

3. Key Contributions

4. Results and Evaluation

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models