A Step Toward Federated Pretraining of Multimodal Large Language Models

This paper introduces Fed-CMP, a pioneering framework for federated pretraining of Multimodal Large Language Models that addresses parameter interference and gradient oscillations through Canonical Reliability-Aware Aggregation and Orthogonality-Preserved Momentum to enable efficient collaborative training of cross-modal projectors.

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu

Published 2026-03-31
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a brilliant, super-smart robot (a Multimodal Large Language Model, or MLLM) to understand the world. Right now, this robot is great at reading text and looking at pictures, but it's hitting a wall.

The Problem: The "Data Famine"

Think of the robot's brain as a student. To learn, it needs to read millions of books and look at millions of photos.

  • The Issue: All the "good" public books and photos are already used up. The student has read everything available in the public library.
  • The Hidden Treasure: Meanwhile, there are billions of more photos and stories locked away in private places: your phone, your doctor's office, your bank's servers. These are "data silos."
  • The Rule: You can't just grab all these private files and put them in one big pile (centralized training) because of privacy laws. It's like trying to read everyone's diary without their permission.

The Old Solution: Federated Learning (The "Group Project")

To solve this, scientists use Federated Learning. Imagine a teacher who doesn't collect the students' diaries. Instead, the teacher sends a "study guide" to every student. Each student studies their own private diary at home, takes notes, and sends only the notes back to the teacher. The teacher combines the notes to make a better guide for the next round.

But here's the catch:
Most research has only used this method for fine-tuning (polishing a robot that already knows how to think). No one has successfully tried to use it for the pre-training phase (teaching the robot how to think in the first place). Why? Because pre-training is messy.

  1. The "Confused Translator" Problem: Different students (clients) have very different diaries. One student only has pictures of cats; another only has pictures of cars. When they send their notes back, they are trying to translate "Cat" and "Car" into the robot's language in completely different ways. If you just mix their notes together, the robot gets confused and starts speaking gibberish.
  2. The "One-Shot" Problem: In this new setup, students only get to read their private diary once. They can't go back and re-read a page to fix a mistake. This causes the robot's learning to wobble back and forth (oscillate), forgetting what it learned yesterday because today's data is so different.

The New Solution: Fed-CMP (The "Smart Group Project")

The authors of this paper propose a new framework called Fed-CMP. Think of it as a super-smart teacher who knows exactly how to manage this chaotic group project. They use two clever tricks:

Trick 1: The "Universal Translator" (Canonical Reliability-Aware Aggregation)

Instead of just averaging the students' notes (which causes confusion), the teacher creates a Universal Translator.

  • Imagine the teacher says, "Let's all agree that 'Cat' means 'Furry Animal' and 'Car' means 'Fast Machine'." This is the Shared Alignment Basis.
  • Now, when a student sends their notes, they don't send the whole definition. They just send a small note saying, "I have 5 extra examples of Furry Animals" or "I have 2 extra examples of Fast Machines." These are the Client-Specific Coefficients.
  • The Safety Check: The teacher also checks who is reliable. If a student is sending notes that are messy or don't match the Universal Translator, the teacher ignores them or gives them less weight. This stops the "confused translator" problem.

Trick 2: The "Steady Hand" (Orthogonality-Preserved Momentum)

Remember the "One-Shot" problem where the robot wobbles because it can't re-read?

  • Imagine the robot is walking a tightrope. Every time a new student sends notes, it tries to take a giant step. Because the notes are so different, the robot stumbles.
  • The teacher gives the robot a Steady Hand (Momentum). Instead of reacting wildly to every single new note, the robot looks at the average direction everyone has been walking over time.
  • The Twist: The teacher ensures the robot doesn't lose its balance while doing this. They use a special mathematical trick (Orthogonal Projection) to make sure the robot's "sense of direction" stays perfectly straight, even as it smooths out the bumps. This prevents the robot from forgetting what it learned earlier (Catastrophic Forgetting).

The Result

The authors tested this on four different "group projects" with different types of messy data.

  • The Result: Their new method (Fed-CMP) learned much faster and became much smarter than all the old methods. It successfully taught the robot to understand the world using private data without ever seeing the private data itself.

Why This Matters

This is a big step forward. It means we can build super-intelligent robots that learn from everyone's private data (your photos, medical records, etc.) without ever violating your privacy. It unlocks a massive amount of knowledge that was previously locked away, helping AI become smarter and more useful for everyone.