Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

The Big Idea: Shrinking a Giant Without Losing Its Brain

Imagine you have a brilliant, 11-year-old genius student named Bielik. This student is incredibly smart, knows a lot about Polish culture, history, and language, and can solve complex problems. However, there's a catch: to keep this student in your house, you need a massive, expensive mansion (a huge computer server) with a giant library and a team of 50 assistants (50 layers of neural network) to help them think. Most people can't afford a mansion like that.

The goal of this paper is to create a 7-year-old version of this genius student. This new student, Bielik-Minitron, should be small enough to live in a normal apartment (a consumer graphics card like an RTX 4090) but still be almost as smart as the original genius.

The team didn't just try to teach a new, smaller kid from scratch (which takes years and millions of dollars). Instead, they used a clever two-step process to "shrink" the existing genius while keeping their knowledge intact.

Step 1: The "Surgical" Trim (Structured Pruning)

The Analogy: Imagine the original genius student has a backpack full of 11,000 tools. Some are heavy hammers, some are tiny screwdrivers, and some are just useless rocks they picked up along the way.

The team used a technique called Structured Pruning. Instead of randomly throwing things out, they acted like a surgeon. They looked at the student's brain and asked: "Which tools do we actually use every day? Which ones are just taking up space?"

The Cut: They removed entire sections of the backpack that weren't being used much. They took out 10 layers of the "thinking process" (reducing the depth) and made the "workbench" narrower (reducing the width).
The Result: They cut the backpack's size by 33%. The student is now much lighter and faster to carry, but they still have the most important tools.

Step 2: The "Shadow Training" (Knowledge Distillation)

The Analogy: Now that the student has a smaller backpack, they are a bit confused. They lost some of their muscle memory. If you just let them go, they might forget how to speak Polish perfectly.

So, the team set up a Shadow Training program.

The Teacher: The original 11B genius sits next to the new 7B student.
The Lesson: The teacher doesn't just say, "The answer is 'Yes'." Instead, the teacher whispers the whole thought process: "I'm 90% sure it's 'Yes', but there's a 5% chance it's 'Maybe', and a 5% chance it's 'No'."
The Magic: The small student learns not just the right answers, but the nuance and confidence of the big teacher. This is called Logit-Based Distillation. It's like the small student learning to think exactly like the big one, just with fewer neurons.

Step 3: The "Polish School" (Alignment)

The Analogy: Even with the training, the student might be a bit rude or bad at following instructions. So, they went to a specialized "Polish School" for three final rounds of training:

SFT (Supervised Fine-Tuning): Learning how to have a polite conversation and follow rules.
DPO (Preference Optimization): Learning what humans actually like to hear (e.g., "Don't be mean," "Be helpful").
GRPO (Reinforcement Learning): Practicing logic puzzles and math problems to sharpen their reasoning skills.

The Results: A Super-Compact Powerhouse

After all this work, the team compared the new Bielik-Minitron-7B to the original Bielik-11B.

Size: It's 33% smaller. It fits on a standard gaming computer (like an RTX 4090) instead of needing a data center.
Speed: It's 50% faster. It can type out answers much quicker because it has less "heavy lifting" to do.
Smarts: It retained 90% of the original genius's intelligence.
- In Polish language tests, it scored almost as high as the big model.
- It beat other famous 7B models (like Mistral or Qwen) by a huge margin.
- It even performed better than some 12B and 14B models!

Why This Matters for Everyone

Think of this like downsizing a luxury car. Usually, when you make a car smaller, you lose the V8 engine and the leather seats. But this team managed to shrink the car down to a compact size while keeping the V8 engine and the leather seats.

Why is this a big deal?

For Poland: It proves you can have top-tier AI for the Polish language without needing millions of dollars or supercomputers.
For the World: It shows a blueprint for how to make AI for any language (like Czech, Hungarian, or Swahili) without having to train a giant model from scratch. You can just "shrink" a big model and keep the local flavor.
For You: It means you might soon be able to run a super-smart Polish AI assistant on your own laptop or phone, saving money on cloud servers and keeping your data private.

In short: They took a giant, expensive brain, surgically trimmed the fat, taught the smaller version to think exactly like the big one, and ended up with a tiny, fast, and incredibly smart Polish AI that anyone can run.

1. Problem Statement

The rapid advancement of Large Language Models (LLMs) has led to significant increases in model size, creating substantial barriers to deployment, particularly regarding GPU VRAM requirements and inference latency. While large models (e.g., 11B+ parameters) offer high reasoning capabilities, they are often too resource-intensive for consumer-grade hardware or cost-effective deployment in specific linguistic ecosystems like the Polish language market.
The core challenge addressed is how to compress a high-performance, specialized LLM (Bielik-11B-v3.0) into a smaller, more efficient model (targeting ~7B parameters) without catastrophic loss of reasoning ability, linguistic fidelity, or domain-specific knowledge, while avoiding the prohibitive costs of training a smaller model from scratch.

2. Methodology

The authors employed a two-stage compression pipeline inspired by NVIDIA's Minitron approach, followed by a rigorous multi-stage alignment process.

Stage I: Structured Pruning

Instead of unstructured weight pruning (which requires specialized hardware for speedups), the team utilized structured hybrid pruning to remove entire architectural components.

Multi-Axis Pruning: The model was pruned across four dimensions: Depth (layers), Width (hidden dimension), Attention (heads), and MLP (intermediate dimension).
Importance Estimation: A purely activation-based metric was used to identify redundant components. By performing forward passes on a calibration dataset, the system calculated the magnitude of activations for neurons, attention heads, and layers. Components with consistently low activation magnitudes were deemed least critical.
Search Space: Ten different pruning configurations (EXP_001 to EXP_010) were evaluated.
Selected Configuration (EXP_010): The optimal "Golden Ratio" configuration reduced the model from 11.04B to 7.35B parameters (a 33.4% reduction) by:
- Reducing layers from 50 to 40.
- Reducing the FFN intermediate dimension from 14,336 to 11,264.
- Preserving the original hidden dimension (4096) and attention head topology to maintain structural invariants.

Stage II: Knowledge Distillation (KD)

To recover performance lost during pruning, the student model (7.35B) was initialized with the surviving weights of the teacher (11.04B) and trained via logit-based knowledge distillation.

Objective: Minimized the Kullback–Leibler (KL) divergence between the teacher's and student's output probability distributions (logits).
Strategy: The training used a logit-only approach (ignoring ground-truth hard labels) with temperature scaling. This allowed the student to learn the teacher's "dark knowledge"—nuanced probabilities and confidence calibrations for alternative tokens—crucial for complex morphological structures in Polish.
Tools: NVIDIA NeMo Framework and Model Optimizer were used for distributed training on 16 NVIDIA H200 GPUs.

Stage III: Alignment Pipeline

To transform the distilled base model into a production-ready assistant, three post-training stages were applied:

Supervised Fine-Tuning (SFT): Trained on ~20M high-quality instruction pairs (Polish/English) to align with conversational nuances.
Direct Preference Optimization (DPO-P): Used a "Positive" variant to align the model with human preferences and reduce harmful outputs.
Group Relative Policy Optimization (GRPO): Applied reinforcement learning with verifiable rewards for STEM and logical tasks to bridge reasoning gaps often found in smaller models.

3. Key Contributions

Bielik-Minitron-7B: The creation of a 7.35B parameter model that retains 90.1% of the performance of its 11.04B teacher while offering a ~50% increase in inference throughput.
Validation of Minitron for Low-Resource Languages: Demonstrated that the NVIDIA Minitron methodology (structured pruning + logit distillation) is highly effective for European languages, specifically Polish, where training from scratch is often financially unfeasible.
Hardware-Efficient Deployment: The final model fits within 14GB of VRAM (FP16), enabling high-performance Polish NLP on consumer-grade GPUs (e.g., RTX 3090/4090) without requiring enterprise infrastructure.
Systematic Architecture Search: Provided empirical evidence that a hybrid approach (reducing both depth and width) outperforms single-axis pruning, identifying a specific "Golden Ratio" (40 layers, 11,264 FFN dim) that balances stability and efficiency.

4. Results & Evaluation

The model was evaluated against the teacher (Bielik-11B-v3.0) and various external competitors on multiple benchmarks:

Overall Performance: Recovered 90.1% of the baseline's aggregate score.
Open PL LLM Leaderboard: Achieved 62.46, outperforming larger models like Qwen3-14B (62.24) and Mistral-7B (47.74), and trailing only 0.11 points behind the 14.7B Phi-4 model.
Polish-Specific Tasks:
- EQ-Bench (Emotional Reasoning): 64.09 (90% recovery of teacher), outperforming 12B+ models.
- CPTUB (Complex Text Understanding): 3.83 score, preserving syntactic capabilities despite a 10x parameter difference compared to 70B models.
- Medical Leaderboard: 44.36% accuracy, outperforming larger models like Mistral-Small-24B and Qwen2.5-7B.
- Belebele (Reading Comprehension): 78.03 score, retaining 94% of the teacher's performance.
Function Calling (BFCL): Achieved 94.50% accuracy in non-live multiple AST scenarios, matching industry baselines like Gemma-3-12B.
Quantization Resilience: The model showed high robustness to quantization. The 4-bit (Q4_K_M) version retained 99% of the original performance (61.89 vs 62.46), making it viable for edge devices.
Inference Speed:
- Throughput: Increased from 54.42 tok/s to 81.41 tok/s (+49.6%).
- Latency: Time per Output Token (TPOT) improved by 32.6% (18.28ms $\to$ 12.32ms).

5. Significance

This work provides a scalable blueprint for developing efficient, localized language models for less-represented languages. By proving that a 33% reduction in parameters can yield a model that is faster, cheaper to deploy, and nearly as capable as its larger counterpart, the authors demonstrate a pathway to democratize access to state-of-the-art AI.
The study highlights that structured pruning combined with logit-based distillation is superior to training smaller models from scratch, significantly reducing the carbon footprint and computational cost of LLM development. This approach allows European language ecosystems to compete with English-centric models without requiring multi-million-dollar training budgets.