A Step Toward Federated Pretraining of Multimodal Large Language Models

Imagine you are trying to teach a brilliant, super-smart robot (a Multimodal Large Language Model, or MLLM) to understand the world. Right now, this robot is great at reading text and looking at pictures, but it's hitting a wall.

The Problem: The "Data Famine"

Think of the robot's brain as a student. To learn, it needs to read millions of books and look at millions of photos.

The Issue: All the "good" public books and photos are already used up. The student has read everything available in the public library.
The Hidden Treasure: Meanwhile, there are billions of more photos and stories locked away in private places: your phone, your doctor's office, your bank's servers. These are "data silos."
The Rule: You can't just grab all these private files and put them in one big pile (centralized training) because of privacy laws. It's like trying to read everyone's diary without their permission.

The Old Solution: Federated Learning (The "Group Project")

To solve this, scientists use Federated Learning. Imagine a teacher who doesn't collect the students' diaries. Instead, the teacher sends a "study guide" to every student. Each student studies their own private diary at home, takes notes, and sends only the notes back to the teacher. The teacher combines the notes to make a better guide for the next round.

But here's the catch:
Most research has only used this method for fine-tuning (polishing a robot that already knows how to think). No one has successfully tried to use it for the pre-training phase (teaching the robot how to think in the first place). Why? Because pre-training is messy.

The "Confused Translator" Problem: Different students (clients) have very different diaries. One student only has pictures of cats; another only has pictures of cars. When they send their notes back, they are trying to translate "Cat" and "Car" into the robot's language in completely different ways. If you just mix their notes together, the robot gets confused and starts speaking gibberish.
The "One-Shot" Problem: In this new setup, students only get to read their private diary once. They can't go back and re-read a page to fix a mistake. This causes the robot's learning to wobble back and forth (oscillate), forgetting what it learned yesterday because today's data is so different.

The New Solution: Fed-CMP (The "Smart Group Project")

The authors of this paper propose a new framework called Fed-CMP. Think of it as a super-smart teacher who knows exactly how to manage this chaotic group project. They use two clever tricks:

Trick 1: The "Universal Translator" (Canonical Reliability-Aware Aggregation)

Instead of just averaging the students' notes (which causes confusion), the teacher creates a Universal Translator.

Imagine the teacher says, "Let's all agree that 'Cat' means 'Furry Animal' and 'Car' means 'Fast Machine'." This is the Shared Alignment Basis.
Now, when a student sends their notes, they don't send the whole definition. They just send a small note saying, "I have 5 extra examples of Furry Animals" or "I have 2 extra examples of Fast Machines." These are the Client-Specific Coefficients.
The Safety Check: The teacher also checks who is reliable. If a student is sending notes that are messy or don't match the Universal Translator, the teacher ignores them or gives them less weight. This stops the "confused translator" problem.

Trick 2: The "Steady Hand" (Orthogonality-Preserved Momentum)

Remember the "One-Shot" problem where the robot wobbles because it can't re-read?

Imagine the robot is walking a tightrope. Every time a new student sends notes, it tries to take a giant step. Because the notes are so different, the robot stumbles.
The teacher gives the robot a Steady Hand (Momentum). Instead of reacting wildly to every single new note, the robot looks at the average direction everyone has been walking over time.
The Twist: The teacher ensures the robot doesn't lose its balance while doing this. They use a special mathematical trick (Orthogonal Projection) to make sure the robot's "sense of direction" stays perfectly straight, even as it smooths out the bumps. This prevents the robot from forgetting what it learned earlier (Catastrophic Forgetting).

The Result

The authors tested this on four different "group projects" with different types of messy data.

The Result: Their new method (Fed-CMP) learned much faster and became much smarter than all the old methods. It successfully taught the robot to understand the world using private data without ever seeing the private data itself.

Why This Matters

This is a big step forward. It means we can build super-intelligent robots that learn from everyone's private data (your photos, medical records, etc.) without ever violating your privacy. It unlocks a massive amount of knowledge that was previously locked away, helping AI become smarter and more useful for everyone.

1. Problem Statement

The rapid evolution of Multimodal Large Language Models (MLLMs) is currently bottlenecked by the saturation of high-quality public datasets. While vast amounts of diverse multimodal data exist in privacy-sensitive "silos" (e.g., personal devices, private institutions), they cannot be centralized due to strict privacy regulations and data sovereignty laws.

While Federated Learning (FL) offers a solution for privacy-preserving collaboration, existing research focuses almost exclusively on fine-tuning pre-trained models. The foundational pre-training phase remains largely unexplored in federated settings due to two critical challenges:

Parameter Interference: In federated pre-training, clients possess Non-IID (Non-Independent and Identically Distributed) data. This causes clients to learn divergent cross-modal mapping directions. Directly aggregating these divergent projectors leads to destructive interference, resulting in incoherent global mappings.
Gradient Oscillations: Pre-training typically consumes data in a one-pass manner (no repetition). In a federated setting, this means local updates lack memory of historical optimization directions. Consequently, the global model is susceptible to transient gradients from the current round, leading to optimization oscillations and catastrophic forgetting of previously learned alignments.

2. Methodology: Fed-CMP Framework

The authors propose Fed-CMP, a pioneering framework for Federated MLLM Pre-training. They define a specific task called Federated MLLM Alignment (Fed-MA), where the Vision Encoder and LLM are frozen, and only the lightweight cross-modal projector is trained collaboratively.

Fed-CMP addresses the two challenges via two core components:

A. Canonical Reliability-Aware Aggregation (CRA)

Goal: Mitigate parameter interference caused by Non-IID data.

Canonical Space Construction: Instead of averaging raw projector weights (which have conflicting directions), CRA constructs a "canonical space." It concatenates all client projectors and performs Singular Value Decomposition (SVD) to decompose them into:
1. A Shared Alignment Basis ( $U\Sigma$ ): Captures the common cross-modal mapping direction.
2. Client-Specific Coefficients ( $V_k$ ): Represent how each client projects onto this shared basis.
- Benefit: This transforms aggregation from fusing conflicting matrices to combining compatible coefficient vectors.
Reliability-Weighted Fusion: Not all clients contribute equally. CRA calculates a reliability weight ( $w_k$ $w_{k}$ ) based on:
1. Update Magnitude: Reflects information gain ( $\|\theta_k - \theta_0\|$ ).
2. Cross-Modal Alignment Quality: Reflects mapping validity (local alignment error).
- Result: High-reliability clients are weighted more heavily, suppressing noise from poorly aligned clients.

B. Orthogonality-Preserved Momentum (OPM)

Goal: Mitigate gradient oscillations in one-pass collaborative SGD.

Momentum on Shared Basis: Standard momentum is applied to the Shared Alignment Basis (not raw parameters) because the basis is less sensitive to round-specific data perturbations.
Orthogonal Projection: Since linear combinations of orthogonal matrices are generally non-orthogonal, OPM applies a two-step update:
1. Alignment: Resolves sign ambiguity between the current and previous basis.
2. Projection: Projects the momentum-mixed matrix back onto the orthogonal manifold using polar decomposition ( $M = PQ^\top \to Ur+1 = PQ^\top$ ).
Adaptive Momentum: The momentum coefficient ( $\beta$ ) is dynamically adjusted based on the cosine dissimilarity between rounds. If distribution shifts are drastic, the momentum is adjusted to prevent catastrophic forgetting.

3. Key Contributions

Task Definition: Formally introduced the Fed-MA task, defining the first framework for federated pre-training of MLLMs where only the projector is trained.
Novel Framework (Fed-CMP): Proposed a framework that effectively handles the unique challenges of federated pre-training:
- CRA: Solves parameter interference by decomposing projectors into a shared basis and client coefficients.
- OPM: Solves gradient oscillations by applying momentum to the shared basis while preserving geometric structure.
Benchmarking: Constructed four federated pre-training scenarios based on the CC12M dataset with heterogeneous clustering strategies (Image-Image, Text-Text, Joint, and Cross-Modal) to simulate real-world data silos.

4. Experimental Results

The authors evaluated Fed-CMP against eight baselines (including FedAvg, FedProx, MOON, and model merging techniques like TIEs and DARE) across seven multimodal benchmarks (MM-Vet, MMBench, SEED, etc.).

Performance: Fed-CMP achieved state-of-the-art performance across all benchmarks, significantly outperforming both local training and other federated baselines.
- Example (MI-I setting): Fed-CMP scored 32.5 on MM-Vet and 32.7 on MMBench, compared to FedAvg's 20.5 and 23.7 respectively.
Robustness to Heterogeneity: The method showed the most significant gains in highly heterogeneous settings (Image-Image clustering), proving its ability to handle severe distribution shifts.
Stability: Analysis of training trajectories showed that while baselines exhibited volatile performance oscillations, Fed-CMP demonstrated a consistently ascending and smooth optimization path, validating the effectiveness of OPM.
Ablation Studies: Removing either CRA or OPM components led to performance drops, confirming that both are necessary to address the specific challenges of parameter interference and gradient oscillation.

5. Significance

This work represents a critical step toward democratizing MLLM training. By enabling privacy-preserving pre-training, it unlocks the potential of vast, distributed data silos that are currently inaccessible due to privacy laws.

Scalability: It offers a pathway to overcome the data saturation bottleneck of current MLLMs without compromising user privacy.
Foundation for Future Research: It establishes a new paradigm for federated learning in the pre-training stage, moving beyond the current focus on fine-tuning, and provides a robust technical foundation for scaling multimodal intelligence in decentralized environments.