Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity

Imagine you have two master chefs, Chef A and Chef B. They both learned to cook the exact same dish (say, a perfect lasagna) completely independently. They used different recipes, different ingredients, and different techniques, but they both ended up with delicious lasagnas.

Now, imagine you want to create a "Super Chef" by mixing their recipes together. You take Chef A's recipe and Chef B's recipe and average them out, layer by layer.

The Old Problem:
In the past, scientists found that if you just mixed the recipes randomly, the result was a disaster. The lasagna would taste like burnt toast. The reason? Chef A and Chef B might have labeled their ingredients differently. Chef A calls the tomato sauce "Red Sauce," while Chef B calls it "Tomato Base." If you just mix the bowls without realizing they are the same thing, you get a mess.

To fix this, previous research said you had to do two things:

Re-label everything: You had to carefully match Chef A's "Red Sauce" to Chef B's "Tomato Base" (this is called Permutation).
Make the kitchen huge: You needed a massive kitchen with thousands of extra shelves and tools (this is called Model Width) to make sure there was enough room to find the right matches.

The New Discovery:
This paper asks a simple question: "Do we really need to do all that re-labeling if we just make the kitchen big enough?"

The answer is YES, you can skip the re-labeling entirely.

Here is the breakdown of their findings using simple analogies:

1. The "Big Kitchen" Effect (Model Width)

The researchers found that if you make the neural network (the kitchen) wide enough, the two chefs naturally start using the same "language" without you having to force it.

The Analogy: Imagine two people trying to describe a picture. If they have a tiny vocabulary (narrow model), they might struggle to agree on what a "dog" is. But if they have a massive vocabulary (wide model), they have so many words to choose from that they naturally find a way to describe the dog that aligns perfectly, even if they started with different dictionaries.
The Result: When the model is wide enough, simply averaging the two recipes (weights) creates a Super Chef that is just as good as the original two. You don't need to shuffle the ingredients around first.

2. The "Silent Neurons" Secret (Why it works)

Why does a bigger kitchen fix the problem? The paper introduces a concept called LEWC (Layerwise Exponentially Weighted Connectivity).

The Analogy: Think of the neurons in the network as light switches in a giant hallway.
- In a small hallway, if Chef A turns on Switch #1 and Chef B turns on Switch #2, and you mix them, you might get a short circuit.
- In a huge hallway (a wide model), the switches are so spread out that Chef A's "on" switches and Chef B's "on" switches rarely overlap. They are like two groups of people standing in a massive stadium; they are so far apart that they don't bump into each other.
The Magic: Because they don't overlap, when you mix them, the "on" switches from Chef A stay "on," and the "on" switches from Chef B stay "on." They don't cancel each other out. The final result is a perfect blend of both chefs' work.

3. The "Volume Knob" (Softmax Temperature)

There is one small catch. When you mix two wide models, the final signal can get a little quiet (the numbers get smaller).

The Fix: It's like turning up the volume knob on a stereo. The paper shows that if you just adjust the "temperature" (a mathematical volume control) at the very end, the signal becomes loud and clear again, and the performance is perfect.

4. The "Low-Rank" Requirement

The paper also discovered that this only works if the chefs aren't trying to use every single tool in the kitchen.

The Analogy: If the chefs are using every single tool in a 10,000-piece toolbox, they will clash. But if they are "lazy" and only use a small, specific set of tools (a low-rank structure), they leave plenty of empty space for the other chef to work.
The Lesson: The training process (specifically something called "weight decay") naturally forces the chefs to be "lazy" and use fewer tools, which makes this mixing magic possible.

Summary

The Big Takeaway:
For a long time, we thought merging two AI models was like trying to merge two different languages—you needed a translator (permutation) and a huge dictionary (width).

This paper proves that if you just make the dictionary big enough, the two languages naturally become compatible. You don't need a translator anymore. You just need to turn up the volume slightly at the end.

Why does this matter?

Simpler AI: We don't need complex algorithms to shuffle model parts around.
Better Merging: We can combine different AI models (like merging a model trained on cats with one trained on dogs) much more easily to create a smarter, more robust AI.
Efficiency: It saves us time and computing power by skipping the difficult "search for the right permutation" step.

In short: Bigger models are not just smarter; they are more cooperative. They naturally find a way to work together without needing a middleman.

1. Problem Statement

Linear Mode Connectivity (LMC) is the phenomenon where two independently trained neural network models can be connected by a low-loss linear path in parameter space. Traditionally, achieving LMC has been believed to require two conditions:

Permutation Search: Since neural networks exhibit permutation symmetry (neurons in a layer can be reordered without changing input-output behavior), one must find an optimal permutation $\pi$ to align the weights of two models ( $\theta_a$ and $\pi(\theta_b)$ ) before interpolating them.
Sufficient Model Width: Prior research (e.g., Ainsworth et al., 2023) suggested that finding such a permutation is only feasible when models are extremely wide (e.g., 32 $\times$ width multiplier for ResNet-20), as a wider model provides a larger search space for valid permutations.

The Core Question: Is the permutation search truly necessary if the model is sufficiently wide? Or does widening the model itself inherently facilitate LMC even without any permutation alignment?

2. Methodology

The authors conducted extensive empirical experiments and theoretical analysis to investigate the relationship between model width and LMC without permutations.

Experimental Setup:
- Models: Multilayer Perceptrons (MLP), VGG-11, and ResNet-20.
- Datasets: MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100.
- Procedure: Two models were trained independently with different random seeds. Their weights were linearly interpolated ( $\theta_c = \lambda\theta_a + (1-\lambda)\theta_b$ ) across various interpolation coefficients $\lambda \in [0, 1]$ .
- Variables: The authors systematically varied the width multiplier (from 0.125 $\times$ to 32 $\times$ ) and tested performance with and without weight matching (permutation search). They also analyzed the impact of weight decay (regularization strength).
- Calibration: To address loss discrepancies caused by logit scaling, the authors applied softmax temperature calibration (inverse temperature scaling) to the merged models.
Theoretical Framework:
- The authors introduced a new concept called Layerwise Exponentially Weighted Connectivity (LEWC).
- They derived sufficient conditions for LEWC: Weak Additivity of ReLU activations and Reciprocal Orthogonality of weight matrices.
- They analyzed how model width and weight decay influence these conditions, specifically focusing on the low-rank structure of weight matrices.

3. Key Contributions

A. Permutations are Not Essential for Wide Models

The paper demonstrates that simply increasing the model width is sufficient to achieve LMC without any permutation search.

Findings: As the width multiplier increases, the test accuracy of the merged model (without permutations) monotonically improves, eventually matching the performance of the original independent models.
Loss Barrier: While raw loss barriers remained high, applying softmax temperature calibration reduced the loss barrier to nearly zero for sufficiently wide models, confirming LMC.
Comparison: For sufficiently wide models, the performance gap between "no permutation" and "optimal permutation (via Weight Matching)" vanishes.

B. Introduction of Layerwise Exponentially Weighted Connectivity (LEWC)

The authors propose LEWC as the mechanism explaining LMC in wide, unpermuted models.

Definition: LEWC states that the output of the merged model at layer $\ell$ is an exponentially weighted sum of the outputs of the original models:
$f_\ell(x; \lambda\theta_a + (1-\lambda)\theta_b) \approx \lambda^\ell f_\ell(x; \theta_a) + (1-\lambda)^\ell f_\ell(x; \theta_b)$
Implication: This implies the merged model behaves like an ensemble of the two original models. The exponential decay of weights ( $\lambda^\ell$ ) explains why the logit norms shrink with depth, necessitating temperature calibration to restore accuracy.

C. Mechanism: Low-Rank Structure and Orthogonality

The paper explains why widening facilitates LEWC through two sufficient conditions:

Weak Additivity: In wide networks, ReLU activations behave linearly along the interpolation path. This occurs because the "active neurons" (those with large pre-activation magnitudes) of the two models do not overlap.
Reciprocal Orthogonality: The weights of one model applied to the activations of the other model yield near-zero results ( $W^{(a)}_\ell z^{(b)}_{\ell-1} \approx 0$ ).

Role of Low-Rankness: The authors show that these conditions rely on the low-rank structure of weight matrices. Widening the model (with appropriate weight decay) induces a low-rank structure where the effective rank is small relative to the width. This causes the active dimensions of the two models to be disjoint (orthogonal), satisfying the conditions for LEWC.

D. The Critical Role of Weight Decay

The study highlights that weight decay is crucial for inducing the low-rank structure required for LMC.

Experiment: When weight decay is weakened (leading to higher-rank weight matrices), both LEWC and LMC break down, even in wide models.
Conclusion: The success of permutation-free LMC depends not just on width, but on the specific properties of the SGD solution (low-rankness) induced by regularization.

4. Key Results

Accuracy: In experiments on CIFAR-10 with ResNet-20, a 32 $\times$ width multiplier allowed the merged model (without permutations) to achieve test accuracy comparable to the original models ( $\approx$ 92-93%), whereas narrow models failed significantly.
Loss Barrier: With temperature calibration, the loss barrier for 32 $\times$ wide models dropped to near zero ( $<0.01$ ), comparable to models merged with optimal permutations.
Cosine Similarity: The cosine similarity between the merged model's intermediate outputs and the ensemble of original models approached 1.0 as width increased, validating the LEWC hypothesis.
Reciprocal Orthogonality: The ratio $E[\|W^{(a)}z^{(b)}\|] / E[\|W^{(a)}z^{(a)}\|]$ decreased towards zero as width increased, confirming that the models become mutually orthogonal in wide regimes.
Failure Cases: Reducing weight decay (increasing rank) or using second-order optimizers (without hyperparameter tuning) hindered the emergence of LMC, reinforcing the link between low-rankness and connectivity.

5. Significance and Implications

Redefining LMC: The paper challenges the prevailing view that permutation alignment is a prerequisite for LMC. It suggests that model width itself is a more fundamental driver of connectivity than the search for permutations.
Practical Model Merging: This finding simplifies model merging and federated learning. If models are sufficiently wide and trained with appropriate regularization, simple weight averaging (without expensive permutation search) may suffice to create high-performing ensembles.
Theoretical Insight: It provides a new theoretical lens (LEWC) distinct from previous "Layerwise Linear Feature Connectivity" (LLFC) theories. While LLFC relies on models being close (commutativity), LEWC relies on models being orthogonal (reciprocal orthogonality), which is naturally induced by over-parameterization and low-rank solutions.
Future Directions: The work suggests that future model merging strategies should focus on promoting low-rank structures (via regularization) and widening models, rather than solely optimizing for permutation alignment.

In summary, the paper argues that widening the model is the key to achieving linear mode connectivity without permutations, driven by the emergence of low-rank, orthogonal weight structures that allow merged models to function as effective ensembles.