MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Imagine you are trying to teach a brilliant but very young student (an Artificial Intelligence) how to speak and understand the world. You have a massive library of books, websites, and articles in every language imaginable. However, this library is messy. It's full of typos, nonsense, spam, and low-quality content mixed in with the gold nuggets of knowledge.

If you just throw random books at the student, they might learn to speak gibberish or pick up bad habits. You need a librarian to sort through the pile and pick only the best books.

This paper introduces MuRating, a new, super-smart librarian designed specifically for a multilingual world (one that speaks many languages, not just English).

Here is how it works, broken down into simple steps:

1. The Problem: The "English-Only" Librarian

For a long time, the best librarians (AI models that judge text quality) only spoke English. They were great at picking out good English books but were useless for French, Japanese, or Swahili.

The Issue: If you want your AI student to speak 17 different languages, you can't just use an English-only librarian. You'd end up with a student who speaks perfect English but terrible Spanish.
The Old Way: People tried to build a separate librarian for every single language, but that's expensive, slow, and often leads to mistakes because there aren't enough "good examples" to teach them.

2. The Solution: The "Translator-Librarian" (MuRating)

The authors created a clever two-step trick to solve this. Think of it like this:

Step A: The Master Jury (English)
First, they gathered four of the best English-speaking librarians. Instead of asking them to give a score (like "8 out of 10"), they asked them to play a game of "This or That."

Librarian: "Is Text A better than Text B?"
Result: They voted. By comparing thousands of pairs, they created a single, super-reliable "Master Jury" that knows exactly what high-quality English text looks like.

Step B: The Translation Bridge
Here is the magic part. Instead of trying to teach a new librarian from scratch in 17 different languages, they took the Master Jury's decisions and translated them.

They took a pair of English texts (Text A vs. Text B) that the Master Jury agreed was "A is better."
They translated both texts into, say, Spanish.
They told the new system: "If Text A was better in English, then the Spanish version of A is better than the Spanish version of B."

They did this for three types of pairs:

Same Language: Spanish A vs. Spanish B.
Mixed Languages: Spanish A vs. French B (to teach the AI that quality is universal).
Parallel: English A vs. Spanish A (to teach the AI that the same idea in two languages should get the same score).

3. The Result: A Universal Quality Filter

The result is MuRater, a single AI model that can judge the quality of text in 17 languages without needing to be retrained from scratch for each one. It learned the "soul" of quality from English and applied it everywhere else.

4. Did it Work? (The Test Drive)

The researchers used MuRater to pick the top 10% of the best data from the internet and used it to train two new AI students (one small, one big).

The Competition: They compared their students against others trained with random data or older filtering methods.
The Outcome: The students trained with MuRater's selection were smarter. They scored higher on tests for reading comprehension, logic, and general knowledge in both English and the other 17 languages.

The Big Takeaway

Think of MuRating as a universal translator of quality.

Before: You needed a different expert for every language to find good data.
Now: You have one expert who learned the rules of "goodness" in English and used a translator to apply those rules to the whole world.

This means we can build smarter, more inclusive AI that speaks many languages fluently, without needing to manually curate millions of examples for every single language. It's a more efficient, stable, and scalable way to teach AI how to be human.

Here is a detailed technical summary of the paper "MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining."

1. Problem Statement

While high-quality data selection is critical for Large Language Model (LLM) performance, existing model-based selection methods (e.g., QuRater, DCLM, AskLLM) are almost exclusively designed for English. They fail to address the unique challenges of multilingual pretraining, such as:

Lack of Multilingual Raters: Most existing raters cannot evaluate non-English text effectively.
Test Set Contamination: Some recent attempts (e.g., FineWeb2-HQ) rely on benchmark datasets for training raters, risking data leakage into downstream evaluations.
Instability in Translation: Pointwise scoring (assigning absolute scores to individual texts) is highly sensitive to translation artifacts, tone shifts, and phrasing variations, leading to inconsistent quality signals across languages.
Need for Unified Framework: There is a lack of a principled framework that can aggregate English quality signals and transfer them robustly to a multilingual setting without manual heuristics.

2. Methodology: MuRating Framework

MuRating is a two-stage, translation-and-pairwise framework designed to create a scalable, language-agnostic data quality evaluator for 17 languages.

Stage 1: Unified English Rater Aggregation

Instead of relying on a single English rater, MuRating aggregates judgments from four distinct state-of-the-art English raters:

QuRating (Educational value focus)
AskLLM (Prompt-based evaluation)
FineWeb-Edu (Classifier-based)
DCLM (FastText classifier)

Process: The system samples pairs of English documents $(t_A, t_B)$ . Each rater assigns a score. A binary preference is derived ( $t_A \succ t_B$ if $Score_A > Score_B$ ).
Aggregation: A Bradley-Terry model is trained on these pairwise preferences to learn a single, unified quality scorer ( $s_\theta$ ). This model minimizes a binary cross-entropy loss, effectively creating a robust "meta-rater" that captures the consensus of the four base raters.

Stage 2: Translation-Based Multilingual Transfer

The unified English scorer is projected into a multilingual setting using a translation strategy rather than training separate raters for each language.

Data Construction:
- Monolingual Pairs: English pairs $(t_A, t_B)$ are translated into a target language $m$ to form $(t_A^m, t_B^m)$ . The original preference label is preserved.
- Cross-Lingual Pairs: $t_A$ is translated to language $m$ , and $t_B$ to language $m'$ ( $m \neq m'$ ). The preference is transferred assuming $t_A^m \succ t_B^{m'}$ .
- Parallel Pairs: A text $t_A$ and its direct translation $t_A^{m'}$ form a pair $(t_A^m, t_A^{m'})$ . These are assigned a neutral label (preference probability = 0.5) to enforce that semantically identical content receives equal scores regardless of language.
Training Objective: The final loss function combines the pairwise loss (for monolingual and cross-lingual pairs) with a regularization term for parallel pairs:
$L_{total} = L_{pairwise} + \lambda \cdot L_{parallel}$
This ensures the model learns language-invariant quality standards while maintaining sensitivity to relative quality differences.

Model Architecture

Base Model: BGE-M3 (an encoder-based model with strong multilingual representation capabilities).
Head: A linear regression head is added to predict the scalar quality score.
Training Data: 300,000 English pairs annotated by the 4 raters, expanded into 150k monolingual, 150k cross-lingual, and 75k parallel pairs across 17 languages using GPT-4o for translation.

3. Key Contributions

Unified English Aggregation: Consolidated four distinct English raters into a single, robust Bradley-Terry based scorer, outperforming individual raters.
Translation-Based Multilingual Transfer: Demonstrated that projecting English pairwise judgments into monolingual, cross-lingual, and parallel pairs allows for effective language-agnostic quality evaluation without needing language-specific training data.
Pairwise vs. Pointwise Superiority: Proved that pairwise training is significantly more stable and robust to translation noise than pointwise scoring, which suffers from high variance in absolute scores across languages.
Scalable Pretraining Gains: Validated the approach on LLaMA-architecture models (1.2B and 7B parameters), showing consistent improvements over strong baselines.

4. Experimental Results

The authors pre-trained LLaMA models (1.2B and 7B) on a corpus of 1.5T English tokens and 3T multilingual tokens (17 languages), selecting the top 10% of data based on MuRating scores.

Multilingual Benchmarks (18 Languages):
- MuRater(E) (English-anchored training) achieved the highest average score (50.96) across 11 tasks, outperforming the next best baseline (QuRater-M at 49.19) and FineWeb2-HQ (48.97).
- Gains were particularly notable in reasoning tasks (ARC-Challenge, MMLU), suggesting the selected data has deeper conceptual structure.
- MuRater(E) outperformed MuRater(M) (which translates multilingual data to English for scoring), confirming that starting with high-quality English pairs and translating them out is more effective than the reverse.
English-Only Benchmarks:
- MuRater achieved an average score of 51.23 across 12 tasks, outperforming baselines like DCLM (50.23) and QuRater (48.33) by margins of 1–3.4 points.
- It demonstrated robustness across diverse task categories (Reading Comprehension, Commonsense Reasoning, World Knowledge), whereas other baselines showed high variance (e.g., DCLM excelled at HellaSwag but underperformed on ARC-Challenge).
Ablation Studies:
- Cross-Lingual/Parallel Pairs: Removing these pairs increased the Mean Squared Error (MSE) between scores of parallel texts, proving their necessity for language-agnostic consistency.
- Translation Quality: Even when using a weaker translation model (Qwen 3-8B vs. GPT-4o), the MuRater trained on the resulting pairs showed high correlation (Pearson > 0.98), indicating robustness to moderate translation quality variations.

5. Significance

Bridging the Multilingual Gap: MuRating provides a scalable solution for high-quality data selection in non-English contexts, addressing a critical bottleneck in multilingual LLM development.
Cost-Effective: By leveraging existing English raters and translation, it avoids the prohibitive cost of training separate raters for every language or relying on expensive human annotation for multilingual data.
Robustness to Noise: The pairwise approach mitigates the instability caused by translation artifacts, a common failure point in previous multilingual data selection attempts.
Generalizability: The framework is model-agnostic regarding the base LLM (tested on 1.2B and 7B) and the underlying rater architecture, making it a versatile tool for future multilingual pretraining pipelines.

In conclusion, MuRating establishes a new standard for multilingual data curation, demonstrating that aggregating English quality signals through a pairwise, translation-based framework yields superior model performance compared to both uniform sampling and existing state-of-the-art selection methods.