TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

This paper introduces TildeOpen LLM, a 30-billion-parameter open-weight model that achieves superior performance across 34 European languages, particularly for low-resource groups, by employing curriculum learning and dataset upsampling to address data imbalances without requiring increased computational resources.

Toms Bergmanis, Martins Kronis, Ingus J\=anis Pretkalninš, D\=avis Nicmanis, Jelizaveta Jelinska, Roberts Rozis, Rinalds V\=iksna, M\=arcis Pinnis

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are building a giant, super-smart library that can speak 34 different European languages. The goal of this paper is to fix a huge problem: most of these "smart libraries" (called Large Language Models or LLMs) are like snobs. They speak English perfectly, and they speak big languages like French or German well, but they stumble over smaller languages like Latvian, Estonian, or Lithuanian.

Why? Because the internet is mostly written in English. When you train a robot to learn by reading the web, it ends up reading 90% English and only a tiny bit of everything else. It's like trying to learn to cook by only reading Italian recipes; you'll become an Italian chef, but you'll have no idea how to make a traditional Latvian rye bread.

Here is how the team at Tilde (a Latvian tech company) built TildeOpen, a new library that treats every language with equal respect, using a few clever tricks.

1. The "Fairness" Dictionary (The Tokenizer)

First, they had to build a dictionary. In AI, words are chopped up into tiny pieces called "tokens."

  • The Problem: In standard dictionaries, a simple sentence in English might take 5 tokens, but the same sentence in Estonian might take 15 tokens. This makes the AI "stutter" and spend too much brainpower just to say a few words in smaller languages.
  • The Fix: They redesigned the dictionary so that a sentence in Latvian takes roughly the same number of tokens as a sentence in English. It's like giving everyone the same size shoe, regardless of their foot shape, so everyone can run at the same speed.

2. The "Curriculum" School Schedule (Training Strategy)

This is the paper's biggest innovation. Usually, AI training is like a chaotic buffet where the AI eats whatever is on the table. Since there is way more English food, the AI eats mostly English.

Tilde decided to run a school curriculum instead, with three distinct phases:

  • Phase 1 (The Morning Class): They force the AI to eat a perfectly balanced meal. It gets equal portions of Latvian, Finnish, Polish, and English. This ensures it learns the basics of every language equally.
  • Phase 2 (The Lunch Break): Here, they let the AI eat naturally. It gets a huge feast of English and German (because there's so much data available), but it still gets a decent side dish of the smaller languages. This helps the AI learn complex concepts and vast knowledge.
  • Phase 3 (The Final Exam): They go back to the balanced meal. They force the AI to practice the smaller languages again right before it graduates. This "polishes" its skills so it doesn't forget them.

3. The "Upsampling" Trick

For languages that are very rare online (like some Baltic languages), there just isn't enough food.

  • The Fix: They took the existing text for these rare languages and "photocopied" it a few times (up to 2.5 times). It's like if you only had one recipe for a rare dish, you'd read it, then read it again, and again, to make sure you memorized every spice. This helps the AI learn the rare languages without needing to find new data that doesn't exist yet.

4. Cleaning the "Garbage" (Data Filtering)

The internet is full of junk. For this project, they had to be very careful with Russian-language data.

  • The Problem: A lot of Russian content on the web is state-sponsored propaganda (fake news, hate speech, or political manipulation). If the AI reads this, it might start believing lies or repeating hate speech.
  • The Fix: They acted like a strict librarian. They removed entire websites known for spreading propaganda and filtered out specific topics (like war or political manipulation) from the remaining text. They wanted the AI to learn the language of Russian, not the propaganda of the state.

The Results: A Fairer AI

When they tested TildeOpen:

  • It's Smarter: Even though they trained it on less data than other giant models (using fewer computer hours), it performed better at writing and understanding text in Baltic, Slavic, and Finno-Ugric languages.
  • Fewer Mistakes: When humans checked the writing, TildeOpen made up to 10 times fewer mistakes than other leading models for languages like Latvian and Estonian.
  • Open Source: They didn't hide their work. They put the model on Hugging Face (a public library for AI) so anyone can use it for free.

The Bottom Line

Think of this paper as a manifesto for linguistic democracy. The authors proved that you don't need a billion dollars or a supercomputer the size of a city to build a great AI. You just need to be smart about how you feed it data. By treating small languages with the same respect as big ones, they built a model that is fairer, more accurate, and better for everyone in Europe.