ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

DatologyAI, :, Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Rishabh Adiga, Siddharth Joshi, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Published 2026-02-27

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to teach a super-smart robot to speak every language on Earth. For a long time, researchers believed there was a fundamental problem: the robot's brain was too small. They thought that if you taught the robot Spanish, it would forget some French; if you taught it Hindi, it might get worse at English. They called this the "Curse of Multilinguality"—the idea that teaching a machine many languages is a zero-sum game where adding one language hurts the others.

This paper, by a team called DatologyAI, says: "Actually, the robot's brain is fine. The problem is the food we're feeding it."

Here is the simple breakdown of their discovery, using some everyday analogies:

1. The Problem: Garbage In, Garbage Out

Imagine you are trying to learn to cook.

The Old Way: You try to learn 13 different cuisines (Italian, Thai, Mexican, etc.) at the same time. But the only recipe books you have are full of typos, torn pages, and confusing instructions. You end up burning everything.
The Result: You blame your brain, thinking, "I just can't handle learning 13 cuisines at once."
The Paper's Insight: The paper argues that the issue isn't your brain (the AI model); it's the bad recipe books (the data). The internet is full of messy, low-quality text in many languages. If you feed a robot messy data, it gets confused and performs poorly, regardless of how big its brain is.

2. The Solution: The "Curated Kitchen"

The team didn't just dump more data into the robot. Instead, they acted like a strict food critic or a curator.

They went through the internet and picked out only the best, cleanest, most accurate sentences for each of the 13 languages they studied (like Spanish, Hindi, Arabic, Japanese, etc.).
They threw away the spam, the errors, and the nonsense.
The Analogy: Instead of giving the robot a giant, dirty pile of random ingredients, they gave it a small, perfectly organized basket of the freshest, highest-quality ingredients for each specific dish.

3. The Magic Trick: "One Good Meal Helps All"

Here is the most surprising part of their discovery. They found that quality is contagious.

The Experiment: They took a robot and fed it high-quality English data (the "main course") but kept the other languages messy.
The Result: Even though the other languages were messy, the robot got better at them too!
Why? Think of it like learning a language family. If you really understand the grammar and logic of English (the "root"), it helps you understand the structure of Spanish or German, even if your Spanish textbooks are a bit messy.
The Reverse: When they cleaned up the non-English data, the robot actually got better at English too. It turns out that high-quality data in any language acts like a "super-signal" that helps the whole brain work better.

4. Translation: Don't Just Translate "Anything"

Many people try to fix the lack of data in rare languages by just translating English text into those languages.

The Bad Way: Taking a random, low-quality English blog post and translating it into Hindi. This is like translating a recipe written on a napkin with coffee stains. It doesn't help much.
The Good Way: Taking a perfectly curated, high-quality English article and translating it. This is like translating a Michelin-star recipe.
The Verdict: Translation works, but only if the original source is excellent. And even then, the best results come from curating the target language directly, not just translating.

5. The Grand Result: Doing More with Less

The team built a massive dataset (20 trillion words) using these strict curation rules.

The Comparison: Other companies are training their robots on huge amounts of data, using massive amounts of electricity (compute), and still getting mediocre results in non-English languages.
The DatologyAI Robot: Their robot was trained on less data (only about 8% of the total was non-English, and the total training was much smaller than competitors) but used high-quality curated data.
The Outcome: Their smaller, cheaper robots beat the giant, expensive robots of competitors in multilingual tasks. They achieved the same (or better) intelligence with 4 to 10 times less computing power.

The Big Picture

The paper concludes that multilingual AI isn't a curse of capacity; it's a challenge of curation.

If you want a robot that speaks the world's languages well, you don't necessarily need a bigger brain or more electricity. You just need to be a better editor. By carefully selecting the best data for every single language, you can build a robot that is smarter, cheaper, and more inclusive, finally making the "future" (where AI speaks everyone's language) available to everyone, not just English speakers.

In short: It's not about how much you feed the AI; it's about how good the food is.

1. Problem Statement

The paper addresses two primary challenges in developing high-quality multilingual foundation models:

Uneven Data Availability: While English benefits from large-scale, carefully curated public corpora, non-English languages often suffer from fragmented, noisy, or low-resource data. This limits model performance regardless of architectural capacity.
The "Curse of Multilinguality": The empirical observation that training a single model on an increasing number of languages often degrades per-language performance. Historically, this was attributed to a capacity bottleneck (a zero-sum game where languages compete for finite parameters), leading to solutions focused solely on scaling model size or training tokens, which drastically increases computational costs.

The authors hypothesize that the "curse" is not an inherent limitation of multilingual scaling but rather a result of suboptimal data quality and composition. They argue that targeted data curation can mitigate interference and enable compute-efficient scaling.

2. Methodology

The study employs a multi-scale experimental approach, ranging from controlled bilingual experiments to frontier-scale pretraining.

Data Curation Pipelines:
- English: Leveraged state-of-the-art pipelines (DCLM, FineWeb, Nemotron CC) with model-based filtering, embedding-based selection, and synthetic data generation.
- Non-English: Developed bespoke, language-specific pipelines for 13 diverse languages (spanning Latin, Cyrillic, Arabic, Indic, and CJK scripts). Instead of applying English-centric heuristics blindly, they tailored filtering, embedding, and synthetic rephrasing models to the specific linguistic and distributional characteristics of each language.
- Translation Strategy: Investigated translating English data into non-English languages. They compared translating random English text versus translating high-quality, score-filtered English text.
Experimental Setup:
- Models: Llama-based architectures (3B and 8B parameters) trained with a 4096-token context window.
- Training Budgets:
  - Controlled: 60B-token bilingual experiments (50/50 English:Target Language mix).
  - Large-Scale: 1T-token pretraining on a curated general-purpose corpus.
  - Frontier-Scale: Integration into a 20T-token corpus used for the Trinity Large model (400B parameters, 13B active).
- Curriculum: A multi-phase training schedule (Phase 1: 5% multilingual, Phase 2: 10%, Phase 3: 20%) resulting in an overall multilingual allocation of ~7.75% of total tokens.
Evaluation:
- Benchmarks: Multilingual MMLU (knowledge/reasoning), Multilingual ARC-Challenge (science reasoning), and Belebele (reading comprehension).
- Metrics: Zero-shot accuracy (cloze format for small runs, multiple-choice for large runs). Error rates were plotted against training FLOPs to establish a Pareto frontier.

3. Key Contributions & Findings

A. Cross-Lingual Transfer is Mediated by Data Quality

English $\to$ Non-English: Improving the quality of English data alone yields consistent gains in non-English performance. In 12 of 13 languages, curated English data improved non-English benchmarks by an average of 3.91%.
Non-English $\to$ English: The benefit is bidirectional. Curating non-English data also improves English capabilities, yielding a 1.21% average relative gain in English benchmarks.
Language Similarity: The magnitude of transfer gains correlates with linguistic similarity. Languages closer to English (e.g., Spanish, French) see larger gains from English curation than distant languages (e.g., Hindi, Arabic).

B. Bespoke Curation is Essential

English-centric curation is insufficient: While improving English data helps, it does not reach optimal performance for specific target languages.
Bespoke pipelines: Applying tailored, per-language curation pipelines resulted in a 16.87% relative improvement over uncurated baselines, significantly outperforming English-only curation.
Translation Quality Matters: Translating random English text yields marginal gains. However, translating high-quality, score-filtered English documents leads to a 5.09% improvement. The best results are achieved when translation is embedded within a holistic, per-language curation framework.

C. Compute Efficiency and the New Pareto Frontier

Token Efficiency: The DatologyAI models achieved competitive multilingual performance using only ~7.75% multilingual tokens (approx. 80B tokens across 13 languages) within a 1T-token training budget.
Performance vs. Compute:
- A 3B DatologyAI model trained on 1T tokens outperformed a 1.2B model (LFM-2.5) trained on 28T tokens (a 10x reduction in FLOPs).
- An 8B DatologyAI model outperformed 3B models (SmolLM3, Granite-4.0) trained with an order of magnitude more compute.
Frontier Scale Validation: These principles scaled to the Trinity Large (400B) model, which was pretrained on 17T tokens from the curated corpus. It exhibited exceptionally strong multilingual performance relative to its FLOP budget, validating that the curation strategy works at the multi-trillion-token scale.

4. Significance

Redefining the "Curse": The paper challenges the prevailing view that multilingual interference is a fundamental capacity limit. Instead, it posits that the "curse of multilinguality" is a "curse of data quality" that can be solved through rigorous curation.
Shift in Scaling Laws: The results suggest that the bottleneck in multilingual scaling has shifted from model capacity to data quality. By fixing data quality, models can accommodate more linguistic diversity without the traditional performance degradation.
Inclusive AI: The approach demonstrates a path toward language-inclusive foundation models that do not require prohibitive computational resources. It establishes a new Pareto frontier where high multilingual accuracy is achieved with significantly fewer training FLOPs compared to existing open-weight baselines.
Practical Strategy: It provides a concrete, scalable recipe for the industry: move away from blind scaling and toward intentional, per-language data curation and high-quality translation strategies.

In summary, the paper proves that targeted, per-language data curation is the key to unlocking efficient, high-performance multilingual models, effectively turning the "curse of multilinguality" into an opportunity for inclusive AI development.

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

1. The Problem: Garbage In, Garbage Out

2. The Solution: The "Curated Kitchen"

3. The Magic Trick: "One Good Meal Helps All"

4. Translation: Don't Just Translate "Anything"

5. The Grand Result: Doing More with Less

The Big Picture

1. Problem Statement

2. Methodology

3. Key Contributions & Findings

A. Cross-Lingual Transfer is Mediated by Data Quality

B. Bespoke Curation is Essential

C. Compute Efficiency and the New Pareto Frontier

4. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank