Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

This paper introduces Bielik-Minitron-7B, a compressed 7.35B-parameter Polish language model created by applying structured pruning and knowledge distillation to the Bielik-11B-v3.0 model, which achieves a 33.4% parameter reduction and up to 50% inference speedup while retaining approximately 90% of the original model's performance.

Remigiusz Kinas, Paweł Kiszczak, Sergio P. Perez, Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwozdziej

Published 2026-03-13
📖 5 min read🧠 Deep dive

The Big Idea: Shrinking a Giant Without Losing Its Brain

Imagine you have a brilliant, 11-year-old genius student named Bielik. This student is incredibly smart, knows a lot about Polish culture, history, and language, and can solve complex problems. However, there's a catch: to keep this student in your house, you need a massive, expensive mansion (a huge computer server) with a giant library and a team of 50 assistants (50 layers of neural network) to help them think. Most people can't afford a mansion like that.

The goal of this paper is to create a 7-year-old version of this genius student. This new student, Bielik-Minitron, should be small enough to live in a normal apartment (a consumer graphics card like an RTX 4090) but still be almost as smart as the original genius.

The team didn't just try to teach a new, smaller kid from scratch (which takes years and millions of dollars). Instead, they used a clever two-step process to "shrink" the existing genius while keeping their knowledge intact.


Step 1: The "Surgical" Trim (Structured Pruning)

The Analogy: Imagine the original genius student has a backpack full of 11,000 tools. Some are heavy hammers, some are tiny screwdrivers, and some are just useless rocks they picked up along the way.

The team used a technique called Structured Pruning. Instead of randomly throwing things out, they acted like a surgeon. They looked at the student's brain and asked: "Which tools do we actually use every day? Which ones are just taking up space?"

  • The Cut: They removed entire sections of the backpack that weren't being used much. They took out 10 layers of the "thinking process" (reducing the depth) and made the "workbench" narrower (reducing the width).
  • The Result: They cut the backpack's size by 33%. The student is now much lighter and faster to carry, but they still have the most important tools.

Step 2: The "Shadow Training" (Knowledge Distillation)

The Analogy: Now that the student has a smaller backpack, they are a bit confused. They lost some of their muscle memory. If you just let them go, they might forget how to speak Polish perfectly.

So, the team set up a Shadow Training program.

  • The Teacher: The original 11B genius sits next to the new 7B student.
  • The Lesson: The teacher doesn't just say, "The answer is 'Yes'." Instead, the teacher whispers the whole thought process: "I'm 90% sure it's 'Yes', but there's a 5% chance it's 'Maybe', and a 5% chance it's 'No'."
  • The Magic: The small student learns not just the right answers, but the nuance and confidence of the big teacher. This is called Logit-Based Distillation. It's like the small student learning to think exactly like the big one, just with fewer neurons.

Step 3: The "Polish School" (Alignment)

The Analogy: Even with the training, the student might be a bit rude or bad at following instructions. So, they went to a specialized "Polish School" for three final rounds of training:

  1. SFT (Supervised Fine-Tuning): Learning how to have a polite conversation and follow rules.
  2. DPO (Preference Optimization): Learning what humans actually like to hear (e.g., "Don't be mean," "Be helpful").
  3. GRPO (Reinforcement Learning): Practicing logic puzzles and math problems to sharpen their reasoning skills.

The Results: A Super-Compact Powerhouse

After all this work, the team compared the new Bielik-Minitron-7B to the original Bielik-11B.

  • Size: It's 33% smaller. It fits on a standard gaming computer (like an RTX 4090) instead of needing a data center.
  • Speed: It's 50% faster. It can type out answers much quicker because it has less "heavy lifting" to do.
  • Smarts: It retained 90% of the original genius's intelligence.
    • In Polish language tests, it scored almost as high as the big model.
    • It beat other famous 7B models (like Mistral or Qwen) by a huge margin.
    • It even performed better than some 12B and 14B models!

Why This Matters for Everyone

Think of this like downsizing a luxury car. Usually, when you make a car smaller, you lose the V8 engine and the leather seats. But this team managed to shrink the car down to a compact size while keeping the V8 engine and the leather seats.

Why is this a big deal?

  1. For Poland: It proves you can have top-tier AI for the Polish language without needing millions of dollars or supercomputers.
  2. For the World: It shows a blueprint for how to make AI for any language (like Czech, Hungarian, or Swahili) without having to train a giant model from scratch. You can just "shrink" a big model and keep the local flavor.
  3. For You: It means you might soon be able to run a super-smart Polish AI assistant on your own laptop or phone, saving money on cloud servers and keeping your data private.

In short: They took a giant, expensive brain, surgically trimmed the fat, taught the smaller version to think exactly like the big one, and ended up with a tiny, fast, and incredibly smart Polish AI that anyone can run.