TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Imagine you are building a giant, super-smart library that can speak 34 different European languages. The goal of this paper is to fix a huge problem: most of these "smart libraries" (called Large Language Models or LLMs) are like snobs. They speak English perfectly, and they speak big languages like French or German well, but they stumble over smaller languages like Latvian, Estonian, or Lithuanian.

Why? Because the internet is mostly written in English. When you train a robot to learn by reading the web, it ends up reading 90% English and only a tiny bit of everything else. It's like trying to learn to cook by only reading Italian recipes; you'll become an Italian chef, but you'll have no idea how to make a traditional Latvian rye bread.

Here is how the team at Tilde (a Latvian tech company) built TildeOpen, a new library that treats every language with equal respect, using a few clever tricks.

1. The "Fairness" Dictionary (The Tokenizer)

First, they had to build a dictionary. In AI, words are chopped up into tiny pieces called "tokens."

The Problem: In standard dictionaries, a simple sentence in English might take 5 tokens, but the same sentence in Estonian might take 15 tokens. This makes the AI "stutter" and spend too much brainpower just to say a few words in smaller languages.
The Fix: They redesigned the dictionary so that a sentence in Latvian takes roughly the same number of tokens as a sentence in English. It's like giving everyone the same size shoe, regardless of their foot shape, so everyone can run at the same speed.

2. The "Curriculum" School Schedule (Training Strategy)

This is the paper's biggest innovation. Usually, AI training is like a chaotic buffet where the AI eats whatever is on the table. Since there is way more English food, the AI eats mostly English.

Tilde decided to run a school curriculum instead, with three distinct phases:

Phase 1 (The Morning Class): They force the AI to eat a perfectly balanced meal. It gets equal portions of Latvian, Finnish, Polish, and English. This ensures it learns the basics of every language equally.
Phase 2 (The Lunch Break): Here, they let the AI eat naturally. It gets a huge feast of English and German (because there's so much data available), but it still gets a decent side dish of the smaller languages. This helps the AI learn complex concepts and vast knowledge.
Phase 3 (The Final Exam): They go back to the balanced meal. They force the AI to practice the smaller languages again right before it graduates. This "polishes" its skills so it doesn't forget them.

3. The "Upsampling" Trick

For languages that are very rare online (like some Baltic languages), there just isn't enough food.

The Fix: They took the existing text for these rare languages and "photocopied" it a few times (up to 2.5 times). It's like if you only had one recipe for a rare dish, you'd read it, then read it again, and again, to make sure you memorized every spice. This helps the AI learn the rare languages without needing to find new data that doesn't exist yet.

4. Cleaning the "Garbage" (Data Filtering)

The internet is full of junk. For this project, they had to be very careful with Russian-language data.

The Problem: A lot of Russian content on the web is state-sponsored propaganda (fake news, hate speech, or political manipulation). If the AI reads this, it might start believing lies or repeating hate speech.
The Fix: They acted like a strict librarian. They removed entire websites known for spreading propaganda and filtered out specific topics (like war or political manipulation) from the remaining text. They wanted the AI to learn the language of Russian, not the propaganda of the state.

The Results: A Fairer AI

When they tested TildeOpen:

It's Smarter: Even though they trained it on less data than other giant models (using fewer computer hours), it performed better at writing and understanding text in Baltic, Slavic, and Finno-Ugric languages.
Fewer Mistakes: When humans checked the writing, TildeOpen made up to 10 times fewer mistakes than other leading models for languages like Latvian and Estonian.
Open Source: They didn't hide their work. They put the model on Hugging Face (a public library for AI) so anyone can use it for free.

The Bottom Line

Think of this paper as a manifesto for linguistic democracy. The authors proved that you don't need a billion dollars or a supercomputer the size of a city to build a great AI. You just need to be smart about how you feed it data. By treating small languages with the same respect as big ones, they built a model that is fairer, more accurate, and better for everyone in Europe.

Here is a detailed technical summary of the paper "TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation."

1. Problem Statement

Large Language Models (LLMs) suffer from significant performance disparities across languages due to the overwhelming dominance of English and high-resource Western European languages in training data.

Data Imbalance: As models scale, the relative share of non-English data declines, eroding linguistic diversity.
Performance Gap: Existing open-weight models (e.g., Llama 3, EuroLLM) perform significantly worse on Central and Eastern European languages (Balto-Slavic, Finno-Ugric) compared to Western European languages.
Inefficiency: Low-resource languages often require more tokens to encode the same meaning due to suboptimal tokenizers, increasing inference costs and reducing context efficiency.
Sovereignty & Safety: Nearly 170 million Europeans have their first language poorly represented in foundation models. Furthermore, unfiltered training data can propagate state-sponsored propaganda (specifically Russian) and misinformation.

2. Methodology

The authors developed TildeOpen, a 30-billion-parameter foundational model trained on 2 trillion tokens covering 34 European languages. The approach focuses on data curation, tokenizer equity, and a novel curriculum learning strategy.

A. Tokenizer Design for Equity

Goal: Ensure that the same semantic content results in a similar number of tokens across all "focus languages" (e.g., Latvian, Lithuanian, Estonian, Slavic languages) to prevent inference cost inflation.
Implementation: The tokenizer was trained on 4.38B bytes (3B for focus languages, 1.48B for others/code).
Technique: Iterative adjustment of data proportions during training using parallel translations from the FLORES 200 dataset. Proportions were tuned until equivalent content yielded similar token counts across languages.
Specs: SentencePiece with BPE, vocabulary size of 131,072, byte-level fallback, and no text normalization.

B. Data Curation and Filtering

Sources: Aggregated from MADLAD-400, HPLT, Cultura-X, FineWeb, Common Pile, The Stack (code), and MathPile.
Filtering Pipeline:
- URL Filtering: Removed domains with >4 subdomains (spam), blacklisted domains, and pornographic content.
- Deduplication: Used exact line removal and the Onion tool (n-gram based) to remove repetitive content.
- Heuristic Filters: Removed low-quality text based on punctuation ratios, uppercase/digit proportions, and word length.
- PII Removal: Anonymized emails, phone numbers, and IDs using the Faker library.
- Propaganda Filtering (Russian Data): A critical intervention involving:
  - URL Blacklists: Blocking EU-sanctioned Russian state media (e.g., Pravda network).
  - Topic Clustering: Using Latent Dirichlet Allocation (LDA) to identify and remove clusters related to geopolitics, war, and anti-LGBT sentiment, removing ~5% of Russian data to prevent encoding state propaganda.

C. Curriculum Learning Strategy

To address the remaining data imbalance after upsampling, the authors employed a three-phase curriculum learning schedule:

Initial Phase (7.5% of training): Uniform sampling across all languages to ensure early exposure to low-resource languages.
Intermediate Phase (67.5%): Natural data distribution, allowing the model to learn from the abundance of high-resource data.
Final Phase (25%): Return to uniform sampling to reinforce balanced language representation and prevent catastrophic forgetting of low-resource languages.

Upsampling: Low-resource languages were upsampled up to 2.5x, though the curriculum was necessary to balance the overall distribution.

D. Model Architecture & Training

Architecture: 30B parameter dense decoder-only transformer (Llama 3 based).
- 60 layers, 6144 model dimension.
- Group Query Attention (GQA) with 8 KV heads and 48 query heads.
- RoPE (Rotary Position Embeddings) with $\theta=200,000$ .
- SwiGLU feed-forward layers.
Training: Trained on 2T tokens (significantly less than comparable models) using 768 AMD MI250x GPUs on the LUMI supercomputer.
Hyperparameters: Adam optimizer, trapezoidal learning rate scheduler, and weight decay of 0.1.

3. Key Contributions

Equitable Tokenization: A novel tokenizer training method that equalizes token efficiency across diverse European languages, reducing the computational cost of processing low-resource languages.
Curriculum Learning for Equity: A three-phase training schedule that successfully balances the exposure of low-resource languages against high-resource languages without requiring massive data volume.
Propaganda Mitigation: A rigorous filtering pipeline specifically targeting Russian state-sponsored disinformation and propaganda, addressing a unique safety and sovereignty challenge in European AI.
Resource Efficiency: Demonstrated that a 30B model trained on only 2T tokens can outperform larger or similarly sized models trained on 4–6T tokens when specific data curation strategies are applied.

4. Results

Evaluation was conducted across multiple benchmarks (MultiBLiMP, Belebele, ARC, MMLU, Exams) and human linguistic error analysis.

Perplexity: TildeOpen achieved lower per-character perplexity across all language families compared to EuroLLM, ALIA, and Gemma 2. Improvements were most significant for Baltic (+13.8%), Romance (+11.2%), and Finno-Ugric (+11.2%) languages.
Benchmark Performance:
- Generation/Comprehension: Outperformed baselines in MultiBLiMP (grammaticality) and Belebele (reading comprehension), particularly for low-resource languages.
- Parametric Knowledge: Performed on par with models trained on 2x–3x more data (EuroLLM, ALIA) on ARC/MMLU, though Gemma 2 (27B) remained superior in this specific category.
- Local Context: Outperformed baselines on the "Exams" dataset (national exams in original languages), suggesting better handling of local cultural contexts compared to US-centric models.
Linguistic Error Analysis:
- Human evaluation on Latvian and Estonian showed TildeOpen produced up to 10 times fewer errors per 100 words compared to Gemma 2 and significantly fewer than EuroLLM.
- Error rate: <1 error/100 words for Latvian (vs. ~3 for EuroLLM, ~10 for Gemma 2).
Instruction Tuning (Translation): When instruction-tuned on a custom translation corpus, TildeOpen surpassed EuroLLM in automatic translation quality (COMET scores) across all tested language pairs, narrowing the gap to GPT-4.1 despite having ~60x fewer parameters.

5. Significance

Linguistic Sovereignty: The paper demonstrates that Europe can develop high-quality, sovereign LLMs that do not rely on US-centric data or models, specifically addressing the needs of 170 million speakers of underrepresented languages.
Efficiency over Scale: It challenges the "bigger is better" paradigm, proving that careful data curation, equitable tokenization, and curriculum learning can yield superior results with fewer compute resources.
Safety and Ethics: The explicit handling of Russian propaganda sets a precedent for training models in geopolitically sensitive regions, prioritizing factual accuracy and safety over raw data volume.
Open Access: The model, tokenizer, and training data are fully open-weight and publicly available, fostering reproducibility and further research into multilingual equity.

Limitations

Underrepresented Languages: Excluded regional/minority languages (e.g., Catalan, Welsh) due to lack of high-quality data.
Bias Evaluation: Lack of systematic safety/toxicity benchmarks for the specific 34 languages (most existing benchmarks are English-only).
Data Contamination: Despite filtering, some risk of residual bias or propaganda remains, and the model lacks human preference alignment (RLHF).