Advancing Polish Language Modeling through Tokenizer… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a brilliant, multilingual librarian named Bielik. This librarian is incredibly smart, knows a lot about the world, and can chat in many languages. However, there's a problem: Bielik was originally trained using a "universal dictionary" designed for the whole world.

When Bielik tries to speak Polish, this universal dictionary is inefficient. It's like trying to write a complex Polish sentence using a dictionary where every single letter is a separate word. To say "I am going to the store," Bielik might have to use 10 tiny, fragmented pieces of information instead of 4 clear words. This makes Bielik slow, expensive to run, and limits how much he can remember at once (his "context window").

This paper is the story of how the creators gave Bielik a custom-made Polish dictionary to make him faster, smarter, and more efficient, without making him forget everything else he knows.

Here is the breakdown of their journey, using simple analogies:

1. The Problem: The "Universal Dictionary" Glitch

Think of the original tokenizer (the dictionary Bielik uses) as a Swiss Army Knife. It's great because it has a screwdriver, a blade, and a corkscrew for everyone. But if you are a Polish chef trying to chop onions, you don't need a corkscrew; you need a sharp, specialized knife.

The Issue: The Swiss Army Knife (universal tokenizer) breaks Polish words into too many tiny, useless pieces. This is called a high "fertility ratio." It's like trying to fill a swimming pool with a teaspoon instead of a hose. You waste time and energy just moving the water.

2. The Solution: The "Custom Polish Knife" (APT4 Tokenizer)

The team created a new dictionary specifically for Polish, called APT4.

The Magic: Instead of breaking a Polish word into 3 or 4 tiny pieces, this new dictionary treats the whole word as one smooth unit.
The Result: Bielik can now "think" in Polish twice as fast. He uses half the memory to say the same thing, and he can remember twice as much of the conversation at once. It's like upgrading from a bicycle to a sports car for Polish tasks, while still being able to drive on English roads.

3. The Danger: "Amnesia" (Catastrophic Forgetting)

Here's the tricky part. If you suddenly swap a librarian's entire dictionary while they are working, they might get confused and forget how to read any language, including the ones they already knew. This is called Catastrophic Forgetting.

The Fix (FOCUS): To prevent this, the team used a clever technique called FOCUS. Imagine you are teaching a student a new alphabet. Instead of making them start from scratch, you show them how the new letters overlap with the old ones. You say, "This new letter 'A' is just a combination of the old 'A' and 'B'."
The Process: They used math to map the old words to the new words carefully, ensuring Bielik didn't lose his memory of English or his general reasoning skills.

4. The Training: "Re-education Camp" (Two-Stage Pipeline)

You can't just swap the dictionary and expect Bielik to be perfect immediately. They had to retrain him in two phases:

Phase 1 (The Warm-up): They froze most of Bielik's brain and only let him practice with the new dictionary on a small amount of text. It was like teaching him how to hold the new knife without letting him cook a full meal yet.
Phase 2 (The Full Meal): Once he got comfortable, they unfroze his whole brain and let him practice on a massive amount of Polish text (16 billion words!). This allowed him to fully adapt his "muscle memory" to the new, efficient dictionary.

5. The Polish: "Fine-Tuning" (SFT, DPO, GRPO)

After learning the new dictionary, Bielik needed to learn how to be a good conversationalist again.

SFT (Supervised Fine-Tuning): They taught him how to follow instructions and chat politely.
DPO (Preference Optimization): They showed him examples of "good" answers vs. "bad" answers so he learned to prefer being helpful and honest.
GRPO (Reinforcement Learning): They gave him math and logic puzzles to solve, rewarding him when he got the steps right, sharpening his reasoning skills.

6. The Results: Did it Work?

The team put Bielik through a series of tests (like a report card):

Polish Tests: He aced them! In fact, on some tests involving complex Polish logic and emotions, the new "Polish-optimized" Bielik actually scored higher than the original version.
English Tests: He didn't lose his English skills. He still speaks English very well, proving the "amnesia" fix worked.
Speed & Cost: Because he uses fewer "tokens" (pieces of words) to say the same thing, he is faster and cheaper to run.

The Bottom Line

The creators of Bielik v3 took a powerful, general-purpose AI and gave it a specialized Polish brain. They did this without making him forget his other languages or his smarts.

Think of it this way: Before, Bielik was a brilliant polyglot who spoke Polish with a stutter because he was using the wrong dictionary. Now, he speaks Polish fluently, quickly, and with perfect grammar, while still being able to chat in English just as well as before. And the best part? They gave away the blueprints (the model weights) for free so everyone can use this super-efficient Polish AI.

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

1. The Problem: The "Universal Dictionary" Glitch

2. The Solution: The "Custom Polish Knife" (APT4 Tokenizer)

3. The Danger: "Amnesia" (Catastrophic Forgetting)

4. The Training: "Re-education Camp" (Two-Stage Pipeline)

5. The Polish: "Fine-Tuning" (SFT, DPO, GRPO)

6. The Results: Did it Work?

The Bottom Line

1. Problem Statement

2. Methodology

A. Tokenizer Design

B. Vocabulary Adaptation Strategy (FOCUS)

C. Training Pipeline

3. Key Contributions

4. Results and Evaluation

5. Significance

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

1. The Problem: The "Universal Dictionary" Glitch

2. The Solution: The "Custom Polish Knife" (APT4 Tokenizer)

3. The Danger: "Amnesia" (Catastrophic Forgetting)

4. The Training: "Re-education Camp" (Two-Stage Pipeline)

5. The Polish: "Fine-Tuning" (SFT, DPO, GRPO)

6. The Results: Did it Work?

The Bottom Line

1. Problem Statement

2. Methodology

A. Tokenizer Design

B. Vocabulary Adaptation Strategy (FOCUS)

C. Training Pipeline

3. Key Contributions

4. Results and Evaluation

5. Significance

More like this