Functionality-Oriented LLM Merging on the Fisher--Rao Manifold

Imagine you have a team of expert chefs. One is a master of Italian pasta, another is a genius at spicy Thai curries, and a third is a wizard at French desserts. You want to create one single "Super Chef" who can cook all three cuisines perfectly, but you don't have time to train a new person from scratch.

In the world of AI, these chefs are Large Language Models (LLMs) that have been fine-tuned for specific tasks. The process of combining them is called Model Merging.

The Problem: The "Smoothie" Mistake

For a long time, scientists tried to merge these models by simply averaging their brains.

Think of it like taking a cup of pasta sauce, a cup of curry, and a cup of dessert, pouring them into a blender, and hitting "mix."

The Result: You get a muddy, flavorless sludge.
Why? The paper explains that AI models don't live in a flat, straight-line world (Euclidean space). They live in a curved, complex landscape (a manifold). When you just average them, you cut a straight line through the middle of a curved valley.
The Consequence: The "Super Chef" loses their spark. Their internal creativity shrinks (variance collapse), and they stop thinking deeply (rank collapse). They become a generic, boring robot that can't do any of the original tasks well.

The Solution: The "Geodesic" Path

The authors of this paper propose a smarter way to merge these chefs. Instead of a blender, they use a map of the terrain.

They treat the merging process as finding a Karcher Mean (a fancy math term for a "center point") on a curved surface called the Fisher–Rao Manifold.

Here is the analogy:

Old Way (Linear Averaging): Imagine two cities on a globe. If you draw a straight line through the Earth's core to get from one to the other, you end up in the middle of the planet (where there is no air). This is what current methods do; they go "underground" and lose the model's quality.
New Way (Karcher Mean): Imagine walking along the surface of the Earth (the curved path). This is the shortest path without leaving the habitable zone. This path keeps you on the "high-performing manifold"—the area where the models actually work well.

How It Works (The "Spherical Proxy")

Calculating the exact curved path for a massive AI is too hard for computers. So, the authors created a clever shortcut, which they call a "Spherical Proxy."

Normalize: They treat the model's "brain" as a direction on a sphere, ignoring how "big" the numbers are for a moment.
Walk the Curve: They calculate the average direction by walking along the surface of that sphere (like finding the center of a group of people standing on a globe).
Rescale: They give the new model back its original "strength" (norm).

This ensures the new model doesn't shrink or lose its features. It stays on the "high ground" where the intelligence lives.

The Results: A Super Chef Who Actually Works

The paper tested this new method (called KARCHER) against all the old "blender" methods.

When merging 2 models: It was slightly better than the others.
When merging 5, 10, or even 11 models: The old methods completely crashed. The "Super Chef" became useless. But the KARCHER method stayed stable and strong, even when combining very different experts.
The "Collapse" Fix: They checked the "brain activity" of the new models. The old methods made the brain go quiet and flat (collapse). The KARCHER method kept the brain active, diverse, and ready to think.

In a Nutshell

The paper says: "Don't just average AI models like smoothies. Instead, walk the curved path between them to find a center point that preserves their unique skills."

This allows us to combine many different AI experts into one powerful, stable super-model without needing to retrain them or lose their intelligence. It's like creating a true polymath who can speak every language and cook every cuisine, without losing their soul.

Here is a detailed technical summary of the paper "Functionality-Oriented LLM Merging on the Fisher–Rao Manifold".

1. Problem Statement

Current Large Language Model (LLM) merging techniques primarily operate in parameter space using Euclidean heuristics (e.g., linear averaging, task vectors). The authors identify three critical limitations of these approaches:

Misalignment of Goal: Linear averaging operates on Euclidean coordinates, whereas the true goal is to merge functionality (predictive behaviors). The relationship between parameters and function is non-linear.
Representation Collapse: When merging source checkpoints that are distant or heterogeneous, Euclidean blends often cause variance collapse (shrinking activation variance) and rank collapse (degradation of effective dimensionality). This leads to a sharp drop in accuracy.
Scalability Issues: Many geometry-inspired methods (like SLERP) are designed for two-model interpolation and do not extend cleanly or principledly to merging $N > 2$ experts.

The core geometric insight is that low-loss regions in the parameter space form curved valleys. Fine-tuned models often lie on thin shells around a base model. Linear chords connecting these points cut across the curvature, drifting off the high-performing manifold and shrinking norms.

2. Methodology

The paper proposes formulating model merging as computing a Karcher (or Fréchet) mean on the Fisher–Rao (FR) manifold.

Theoretical Foundation

Fisher–Rao Metric: Instead of Euclidean distance, the authors use the Fisher–Rao metric, which links parameter-space geometry to the divergence between predictive distributions.
$d^2_{FR}(\theta, \theta') \approx 2 \cdot KL(p_\theta \parallel p_{\theta'})$
Minimizing the FR distance is locally equivalent to minimizing the Kullback-Leibler (KL) divergence between the merged model's predictions and the source models' predictions.
Objective: The goal is to find a merged parameter set $\theta^*$ that minimizes the weighted sum of squared geodesic distances to all expert models $\theta^{(i)}$ :
$\theta^* := \arg \min_{\theta} \sum_{i=1}^N \alpha^{(i)} d^2_{FR}(\theta, \theta^{(i)})$

Practical Algorithm: Spherical Proxy

Computing exact Fisher–Rao geodesics for modern LLMs is intractable due to the complexity of the Fisher Information Matrix (FIM). The authors derive a lightweight fixed-point algorithm using a spherical proxy:

Normalization: Treat parameter blocks (e.g., layers or tensor groups) as vectors and normalize them to the unit sphere ( $S^{d-1}$ ). This addresses the "norm shrinkage" issue common in linear averaging.
Geodesic Mean: Compute the Karcher mean on the sphere using closed-form log/exp maps. For two models, this reduces to SLERP (Spherical Linear Interpolation); for $N$ models, it generalizes the concept.
Rescaling: Rescale the resulting vector by a representative norm (e.g., the mean norm of the source blocks) to preserve the magnitude of the weights.
Optional Preconditioning: The method can incorporate diagonal or KFAC Fisher estimates as a preconditioner within the log-map approximation to protect high-Fisher directions.

3. Key Contributions

Formulation: The first formulation of model merging as a Karcher mean computation on the Fisher–Rao manifold, directly targeting KL-based function distance rather than Euclidean chord length.
Algorithm: A practical, fixed-point algorithm using a spherical proxy that:
- Generalizes SLERP from 2 models to $N > 2$ models.
- Preserves weight norms to prevent activation collapse.
- Is computationally lightweight and requires minimal tuning.
Empirical Validation: Comprehensive evidence showing the method remains stable as the number of merged models and their heterogeneity increase, outperforming strong baselines.

4. Experimental Results

The method was evaluated on the Qwen2.5 family of models across multiple benchmarks (HellaSwag, BBH, MMLU-Pro, MuSR, GPQA-Diamond).

Performance on Small Scale ( $m=2$ ): The proposed KARCHER method outperformed all baselines (including LERP, SLERP, TIES, DARE, Model Stock, and Arcee Fusion), achieving the highest average score (0.597 vs. 0.577 for the next best, LERP).
Performance on Large Scale ( $m=5$ to $11$):
- Baseline Failure: Most Euclidean-based methods (TIES, DARE, DELLA) suffered catastrophic performance drops when merging 5 or more models, with average scores plummeting to ~0.24.
- KARCHER Stability: The proposed method remained stable and even improved slightly as $m$ increased, achieving an average score of 0.610 at $m=5$ and maintaining high performance up to $m=11$ .
Collapse Diagnostics:
- Activation Variance & Rank: Analysis of layer-wise activation statistics showed that KARCHER preserves significantly higher effective rank (EffRank) and activation variance compared to LERP and TIES, particularly in mid-to-deep layers. This confirms the method mitigates representation collapse.
Scale Robustness: The method remained superior even when merging models of different scales (135M, 360M, 1.7B), though the gains were more modest in this "nearby" regime compared to the heterogeneous regime.

5. Significance

This work shifts the paradigm of model merging from parameter-space heuristics to function-space geometry.

Solves the "Heterogeneity Problem": It provides a principled solution for merging diverse, distant experts where traditional linear averaging fails due to manifold curvature.
Prevents Collapse: By operating on a manifold that respects the geometry of predictive distributions, it inherently avoids the norm shrinkage and rank collapse that plague current state-of-the-art methods.
Scalability: It enables the reliable aggregation of large pools of specialized models (e.g., $N=11$ ), which is crucial for creating generalist models from specialized fine-tunes without retraining.

Limitations: The method relies on a spherical proxy approximation rather than exact Fisher–Rao geodesics, which may deviate in highly nonlinear regions. It also assumes access to model parameters and does not address licensing or safety alignment issues inherent in merging models trained on different data.

Functionality-Oriented LLM Merging on the Fisher--Rao Manifold

The Problem: The "Smoothie" Mistake

The Solution: The "Geodesic" Path

How It Works (The "Spherical Proxy")

The Results: A Super Chef Who Actually Works

In a Nutshell

1. Problem Statement

2. Methodology

Theoretical Foundation

Practical Algorithm: Spherical Proxy

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Comparison of Outlier Detection Algorithms on String Data

Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates

Interventional Time Series Priors for Causal Foundation Models

Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information

Graph Tokenization for Bridging Graphs and Transformers