Knowledge Fusion of Large Language Models Via Modular SkillPacks

Imagine you have a brilliant chef, Chef A, who is a master at making French pastries, and another chef, Chef B, who is a world-class sushi expert.

Right now, if you want a restaurant that serves both perfect pastries and perfect sushi, you usually have to hire two separate kitchens (two separate AI models). This is expensive, takes up a lot of space, and is hard to manage.

Alternatively, you could try to teach Chef A how to make sushi by making them read every sushi book in the world and practice for months. But there's a catch: in the process of learning sushi, Chef A might start forgetting how to make their famous croissants. This is called "catastrophic forgetting."

This paper introduces a new solution called GraftLLM. Instead of trying to retrain the whole chef or hire two kitchens, they come up with a clever trick: The "SkillPack."

The Core Idea: The "SkillPack" Backpack

Think of an AI model (like a Large Language Model) as a base chef who is already very good at general conversation and basic cooking.

When you want to give this base chef a new superpower (like coding, math, or legal advice), GraftLLM doesn't rewrite the chef's entire brain. Instead, it creates a tiny, lightweight backpack called a SkillPack.

The Grafting Process: The system looks at a "Master Chef" (a huge, powerful AI) who is already great at a specific task. It figures out exactly what makes that master chef so good at that one thing.
The Compression: It takes those specific "good habits" and compresses them into a tiny, efficient backpack (the SkillPack). This is like taking a whole library of sushi recipes and condensing them into a single, perfectly organized cheat sheet.
The Attachment: You then "graft" (attach) this backpack onto your base chef. Now, your base chef can instantly make sushi without ever having to forget how to make pastries.

Why is this better than the old ways?

The paper compares GraftLLM to two other common methods:

The "Full Retraining" Method (Knowledge Distillation): This is like forcing the base chef to go to culinary school for a year to learn sushi. It works, but it's slow, expensive, and the chef might forget their old recipes.
The "Simple Add-on" Method (PEFT/LoRA): This is like giving the chef a simple apron with a few notes. It's cheap, but the notes aren't detailed enough, so the sushi isn't as good as the master chef's.

GraftLLM is the sweet spot: It's as light as the apron (cheap and fast) but as effective as the full culinary school (high quality).

The "Magic Backpack" Features

The paper highlights three superpowers of this SkillPack approach:

No Forgetting (Forget-Free Learning): Because the backpack is separate from the chef's brain, you can take it off and put on a different one (e.g., a "Lawyer Backpack") without the chef forgetting how to be a "Math Wizard." It's like swapping backpacks; your brain stays the same, but your tools change.
Mixing and Matching (Model Fusion): Imagine you have a backpack for "Finance," one for "Medicine," and one for "Law." With GraftLLM, you can have a router (a smart switch) that looks at your question. If you ask about stocks, it automatically puts on the Finance backpack. If you ask about a heart condition, it switches to the Medicine backpack. You get the best of all worlds in one model.
Tiny Size: The paper shows that these backpacks are incredibly small. You can take the knowledge of a massive 72-billion-parameter AI and compress it into a backpack that is only a fraction of the size, yet it still works almost as well as the giant model.

The "Module-Aware" Secret Sauce

How do they make the backpack so small without losing quality? They use a smart compression strategy.

Think of the AI model as a house with different rooms:

The Kitchen (Attention Modules): This is where the heavy lifting happens. The paper says, "Don't squeeze this room too hard, or the food will taste bad." They compress it carefully.
The Hallway (Embedding/Head): This is just for passing things through. They can squeeze this room a lot because it doesn't hold much "flavor."

By treating each room differently, they can shrink the backpack significantly without breaking anything.

The Bottom Line

GraftLLM is like a universal adapter for AI. It allows us to take the best skills from giant, expensive AI models and pack them into tiny, portable "SkillPacks" that can be attached to smaller, cheaper models.

This means:

Cheaper AI: You don't need a supercomputer to run a model that knows everything.
Safer AI: If an AI learns something bad (like how to write hate speech), you can just "unplug" that specific backpack without deleting the whole model.
Smarter AI: You can mix and match skills (coding + writing + math) instantly without the model getting confused.

In short, GraftLLM turns the messy, expensive process of teaching AI new tricks into a simple game of "plug-and-play" backpacks.

1. Problem Statement

The paper addresses the critical challenge of cross-capability transfer in Large Language Models (LLMs), specifically focusing on integrating diverse skills from heterogeneous (different architectures and sizes) source models into a single target model.

Existing approaches face significant limitations:

Model Merging: Most methods (e.g., Ties-Merging, Task Arithmetic) assume homogeneous models with identical architectures, limiting their applicability to heterogeneous LLMs.
Knowledge Distillation:
- Full-parameter fine-tuning: Often ignores the target model's inherent capabilities, leading to catastrophic forgetting of original tasks.
- PEFT (e.g., LoRA): While parameter-efficient, these methods often struggle to absorb sufficient knowledge from source models and underperform compared to full fine-tuning.
Storage and Conflict: Merging multiple capabilities often leads to parameter conflicts, storage bloat, and interference between tasks.

The goal is to develop a method that enables efficient, forget-free, and scalable transfer of capabilities from heterogeneous sources without catastrophic forgetting or excessive parameter overhead.

2. Methodology: GraftLLM

The authors propose GraftLLM, a novel framework that treats model capabilities as modular, transferable units called SkillPacks. The core philosophy is to decouple the target model's base parameters from the specialized knowledge acquired from source models.

A. Two-Stage Training Pipeline

To extract capabilities, the method employs a standard two-stage adaptation on the target model using synthetic data or source model outputs:

Supervised Fine-Tuning (SFT): Aligns the target model with source model behaviors.
Direct Preference Optimization (DPO): Refines the model using preference data to align with human or source model preferences.
The result is a set of delta parameters ( $\Delta\theta = \theta^*_{tgt} - \theta_{tgt}$ ) representing the acquired knowledge.

B. Module-Aware Adaptive Compression

Instead of applying uniform compression, GraftLLM analyzes the sensitivity and function of different model submodules to apply specific compression operators, forming the SkillPack:

Embedding & Output Head: Uses Magnitude Pruning. These layers are less sensitive to pruning, allowing for high sparsity ratios while retaining performance.
Attention Modules: Uses Low-Rank Decomposition (SVD). The singular value spectrum decays rapidly, allowing the projection matrices to be compressed into low-rank factors ( $U, \Sigma, V$ ) with minimal loss of representational capacity.
MLP Modules: Uses Conservative SVD. Due to strong non-linear transformations, these layers require retaining essential singular vectors based on an energy threshold ( $\beta$ ) to avoid performance degradation.
Mixed-Precision Quantization: The compressed components (pruned weights or SVD factors) are further quantized (e.g., 2-bit to 8-bit) using group-wise quantization (GPTQ) to minimize storage overhead.

C. SkillPack Composition and Routing

SkillPack: The compressed delta parameters form a compact, transferable "SkillPack."
Router Mechanism: To support multi-task fusion, a lightweight router ( $R$ $R$ ) determines which SkillPack to activate for a given input.
- Top-1 Routing: Activates the single most relevant SkillPack for efficiency.
- Ensembling: For similar tasks, multiple SkillPacks can be weighted and combined to boost performance.
Forget-Free Learning: Since the base model parameters ( $\theta_{tgt}$ ) remain untouched, the original capabilities are preserved. SkillPacks can be loaded/unloaded dynamically, enabling continual learning without catastrophic forgetting.

3. Key Contributions

GraftLLM Framework: A novel approach for cross-capability transfer in heterogeneous LLMs, structuring knowledge as modular SkillPacks rather than merging full models.
Module-Aware Adaptive Compression: A strategy that tailors compression (pruning, SVD, quantization) to specific model layers (Embedding, Attention, MLP), balancing high compression ratios with task knowledge retention.
Forget-Free Continual Learning: The method enables the target model to acquire new skills (e.g., math, coding) while strictly preserving its original capabilities, solving the catastrophic forgetting problem.
Scalable Fusion: Demonstrates that a single target model can integrate capabilities from multiple diverse sources (e.g., finance, law, biomedicine) with minimal parameter overhead (e.g., +30% parameters vs. training three separate models).

4. Experimental Results

The authors evaluated GraftLLM across three main scenarios:

A. Pairwise Capability Transfer

Setup: Transferring capabilities from strong sources (e.g., Qwen-2.5-72B) to a target (LLaMA-3.1-8B).
Result: GraftLLM consistently outperformed PEFT (LoRA) and standard compression methods (SVD, Pruning), especially in complex DPO settings where other methods failed. It achieved performance close to full-parameter fine-tuning with significantly fewer parameters.

B. Knowledge Fusion (Explicit & Implicit)

Explicit Fusion: On benchmarks like MT-Bench and AlpacaEval 2.0, GraftLLM fused six diverse chat models into a 9.2B parameter model.
- It outperformed all source models and existing merging methods (e.g., Ties-Merging, Twin-Merging).
- Achieved an 8.07% improvement in LC Win Rate over the best parameter fusion method.
Implicit Fusion: Across 10 benchmarks (MMLU, GSM8K, HumanEval, etc.), GraftLLM achieved the highest average scores, outperforming multi-teacher distillation and routing-based baselines by significant margins (e.g., +1.2 average score on Qwen targets).

C. Forget-Free Learning & Unlearning

Scenario: Sequentially learning Math and Coding tasks.
Result: GraftLLM mitigated catastrophic forgetting, maintaining original task performance while acquiring new skills. It outperformed baselines like Model Grafting and Model Tailor by an average of 2.1% on combined benchmarks.
Unlearning: The modularity allows for easy removal of specific SkillPacks (e.g., for detoxification or privacy), which is difficult with full-parameter merging.

D. Highly Distinct Domains

In a multi-domain experiment (Biomedicine, Finance, Law), GraftLLM achieved 99% of the performance of three separate fine-tuned models using only a 30% parameter increase over the base model, demonstrating near-lossless fusion of conflicting tasks.

5. Significance

Efficiency: GraftLLM offers a scalable solution for model fusion, reducing the need to train or store multiple large models for different tasks.
Flexibility: The modular SkillPack design allows for dynamic loading/unloading of capabilities, facilitating applications in continual learning, unlearning, and domain adaptation.
Heterogeneity: It breaks the barrier of homogeneous model merging, enabling the integration of knowledge from vastly different model architectures and sizes.
Practicality: By preserving the base model and isolating task-specific deltas, it provides a "safe" and "clean" method for updating LLMs without the risks of catastrophic forgetting or data contamination.

The code is publicly available at: https://github.com/duguodong7/GraftLLM.