Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity

This paper demonstrates that simply widening neural network models, combined with suitable softmax temperature calibration, is sufficient to achieve linear mode connectivity without the need for parameter permutations, a phenomenon explained by the layerwise exponentially weighted connectivity (LEWC) property where merged layer outputs act as exponentially weighted sums of the original models' outputs.

Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you have two master chefs, Chef A and Chef B. They both learned to cook the exact same dish (say, a perfect lasagna) completely independently. They used different recipes, different ingredients, and different techniques, but they both ended up with delicious lasagnas.

Now, imagine you want to create a "Super Chef" by mixing their recipes together. You take Chef A's recipe and Chef B's recipe and average them out, layer by layer.

The Old Problem:
In the past, scientists found that if you just mixed the recipes randomly, the result was a disaster. The lasagna would taste like burnt toast. The reason? Chef A and Chef B might have labeled their ingredients differently. Chef A calls the tomato sauce "Red Sauce," while Chef B calls it "Tomato Base." If you just mix the bowls without realizing they are the same thing, you get a mess.

To fix this, previous research said you had to do two things:

  1. Re-label everything: You had to carefully match Chef A's "Red Sauce" to Chef B's "Tomato Base" (this is called Permutation).
  2. Make the kitchen huge: You needed a massive kitchen with thousands of extra shelves and tools (this is called Model Width) to make sure there was enough room to find the right matches.

The New Discovery:
This paper asks a simple question: "Do we really need to do all that re-labeling if we just make the kitchen big enough?"

The answer is YES, you can skip the re-labeling entirely.

Here is the breakdown of their findings using simple analogies:

1. The "Big Kitchen" Effect (Model Width)

The researchers found that if you make the neural network (the kitchen) wide enough, the two chefs naturally start using the same "language" without you having to force it.

  • The Analogy: Imagine two people trying to describe a picture. If they have a tiny vocabulary (narrow model), they might struggle to agree on what a "dog" is. But if they have a massive vocabulary (wide model), they have so many words to choose from that they naturally find a way to describe the dog that aligns perfectly, even if they started with different dictionaries.
  • The Result: When the model is wide enough, simply averaging the two recipes (weights) creates a Super Chef that is just as good as the original two. You don't need to shuffle the ingredients around first.

2. The "Silent Neurons" Secret (Why it works)

Why does a bigger kitchen fix the problem? The paper introduces a concept called LEWC (Layerwise Exponentially Weighted Connectivity).

  • The Analogy: Think of the neurons in the network as light switches in a giant hallway.
    • In a small hallway, if Chef A turns on Switch #1 and Chef B turns on Switch #2, and you mix them, you might get a short circuit.
    • In a huge hallway (a wide model), the switches are so spread out that Chef A's "on" switches and Chef B's "on" switches rarely overlap. They are like two groups of people standing in a massive stadium; they are so far apart that they don't bump into each other.
  • The Magic: Because they don't overlap, when you mix them, the "on" switches from Chef A stay "on," and the "on" switches from Chef B stay "on." They don't cancel each other out. The final result is a perfect blend of both chefs' work.

3. The "Volume Knob" (Softmax Temperature)

There is one small catch. When you mix two wide models, the final signal can get a little quiet (the numbers get smaller).

  • The Fix: It's like turning up the volume knob on a stereo. The paper shows that if you just adjust the "temperature" (a mathematical volume control) at the very end, the signal becomes loud and clear again, and the performance is perfect.

4. The "Low-Rank" Requirement

The paper also discovered that this only works if the chefs aren't trying to use every single tool in the kitchen.

  • The Analogy: If the chefs are using every single tool in a 10,000-piece toolbox, they will clash. But if they are "lazy" and only use a small, specific set of tools (a low-rank structure), they leave plenty of empty space for the other chef to work.
  • The Lesson: The training process (specifically something called "weight decay") naturally forces the chefs to be "lazy" and use fewer tools, which makes this mixing magic possible.

Summary

The Big Takeaway:
For a long time, we thought merging two AI models was like trying to merge two different languages—you needed a translator (permutation) and a huge dictionary (width).

This paper proves that if you just make the dictionary big enough, the two languages naturally become compatible. You don't need a translator anymore. You just need to turn up the volume slightly at the end.

Why does this matter?

  • Simpler AI: We don't need complex algorithms to shuffle model parts around.
  • Better Merging: We can combine different AI models (like merging a model trained on cats with one trained on dogs) much more easily to create a smarter, more robust AI.
  • Efficiency: It saves us time and computing power by skipping the difficult "search for the right permutation" step.

In short: Bigger models are not just smarter; they are more cooperative. They naturally find a way to work together without needing a middleman.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →