Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Imagine a world where everyone has a unique, personal AI assistant. Some assistants live on powerful supercomputers, while others run on small, energy-efficient phones. Some are experts at analyzing medical scans, while others are great at writing poetry or understanding fashion trends.

Currently, if these assistants want to learn from each other, they face two big problems:

The "Language Barrier" (Model Heterogeneity): They speak different "languages" (different software architectures). A model built for a phone can't easily share its brain with a model built for a supercomputer because their internal structures don't match.
The "Different Interests" Problem (Data Heterogeneity): They are learning totally different things. If you try to force a fashion expert and a medical expert to merge their brains into one single "average" brain, both end up confused and bad at their jobs.

This paper introduces FedMosaic, a new system that solves both problems, allowing these diverse AI assistants to collaborate without ever sharing their private data. Here is how it works, using simple analogies:

1. The Problem: The "One-Size-Fits-All" Failure

Traditional methods try to take all the AI models, smash them together, and average them out.

Analogy: Imagine trying to make a smoothie by blending a heavy stone, a feather, and a glass of water. The result is a useless mess.
Reality: When AI models trained on different tasks (like medical vs. fashion) are averaged, they interfere with each other. The "medical" knowledge cancels out the "fashion" knowledge, and everyone gets worse.

2. The Solution: FedMosaic

The authors propose a system called FedMosaic (like a mosaic art piece made of different colored tiles that form a beautiful picture). It has two main tools to fix the problems:

Tool A: The "Smart Matchmaker" (RELA)

The Problem: How do we decide who should share knowledge with whom?
The Solution: Instead of forcing everyone to talk to everyone, the system acts like a smart matchmaker.

How it works: Before sharing, the system checks the "gradients" (which are like the AI's internal notes on what it's learning). It asks, "Is Client A learning something similar to Client B?"
The Analogy: Imagine a library. If you are studying Biology, you don't want to borrow books from the Cooking section just because they are in the same building. The system creates a customized global model for each client. If you are a fashion AI, you only get advice from other fashion AIs. If you are a medical AI, you get advice from medical AIs. This prevents the "smoothie" problem.

Tool B: The "Universal Translator" (Co-LoRA)

The Problem: Even if two AIs want to share, they might be built differently (e.g., one has 1 billion parameters, the other has 3 billion). They can't just swap their brains because the "slots" for the information don't line up.
The Solution: They introduce Co-LoRA (Collaborative Low-Rank Adaptation).

How it works: Instead of trying to swap the whole brain, they only swap tiny, specific "notebooks" (modules) that are the same size for everyone, regardless of how big their brain is.
The Analogy: Imagine two people trying to share a secret. One is a giant, the other is a dwarf. They can't swap their entire bodies. But, they can both carry a standard-sized notepad (the Co-LoRA module). The giant writes his secret on the notepad, and the dwarf reads it. The notepad is small enough for the dwarf to carry and simple enough for the giant to write on. They can share knowledge without needing to be the same size.

3. The New Playground: DRAKE

To prove their idea works, the authors built a new testing ground called DRAKE.

The Analogy: Previous tests were like a classroom where every student had the same textbook but different colored pens. DRAKE is a giant, chaotic festival where:
- Some students have giant tablets, others have tiny phones.
- Some are learning to identify cats, others are learning to translate ancient languages, and others are learning to diagnose diseases.
- The tasks change over time (like a festival where the music genre switches every hour).
Why it matters: This mimics the real world much better than previous tests, proving that FedMosaic works in messy, realistic scenarios.

The Results

When they tested FedMosaic on this chaotic festival:

Personalization: Each AI got better at its own specific job (e.g., the fashion AI got better at fashion).
Generalization: They also got better at other jobs they hadn't seen before, because they learned how to learn from their neighbors.
Efficiency: They did this without sending massive amounts of data or requiring everyone to have the same hardware.

Summary

FedMosaic is like a global networking event for AI assistants.

It uses a Smart Matchmaker to ensure you only talk to people who speak your "language" (similar tasks).
It uses a Universal Translator (Co-LoRA) so that a tiny phone AI can share secrets with a giant supercomputer AI, even if they are built differently.
It proves that by working together this way, everyone becomes smarter, faster, and more personal, all while keeping their private data safe at home.

Here is a detailed technical summary of the paper "CO-LORA: COLLABORATIVE MODEL PERSONALIZATION ON HETEROGENEOUS MULTI-MODAL CLIENTS" (published as a conference paper at ICLR 2026).

1. Problem Statement

The paper addresses the critical limitations of existing Personalized Federated Learning (PFL) methods when applied to real-world scenarios involving Multimodal Large Language Models (MLLMs). Current PFL approaches typically rely on restrictive assumptions that fail in practical deployments:

Data Heterogeneity: Existing methods often simulate data heterogeneity by splitting a single dataset into non-IID partitions (label skew). However, real-world clients (e.g., in Agentic AI) often tackle entirely different tasks (e.g., Visual Question Answering vs. Visual Reasoning) with distinct data distributions and temporal shifts.
Model Heterogeneity: Most PFL methods assume clients share the same model architecture. In reality, clients have varying computational resources, leading to different model families (e.g., Llama-based vs. Qwen-based) and scales (e.g., 1B vs. 3B parameters).
The Conflict: Naive model averaging fails under data heterogeneity due to parameter interference between unrelated tasks. Furthermore, aggregating weights is impossible under model heterogeneity due to architectural mismatches (different hidden dimensions and depths).

2. Methodology: FedMosaic

The authors propose FedMosaic, a framework designed to jointly address data and model heterogeneity. It consists of two core components: RELA (for data heterogeneity) and Co-LoRA (for model heterogeneity).

A. RELevance-guided Aggregation (RELA)

To mitigate parameter interference caused by clients learning different tasks, RELA constructs a customized global model for each client based on task similarity.

Client-Wise Gradient Estimation: Instead of using gradients from the client's local training model (which may be biased by the specific task), the server computes gradients using a small, frozen pre-trained model ( $W_s$ ) on the client's data stream. This captures the intrinsic task characteristics without being influenced by the client's specific model architecture.
Decayed Gradients: To handle temporal distribution shifts (where a client's task evolves over time), the server maintains an Exponential Moving Average (EMA) of these gradients. This prevents the model from forgetting previous knowledge while adapting to new data.
Privacy-Preserving Sanitization: To protect privacy, the server receives sanitized gradients ( $\tilde{g}_i$ $\tilde{g}_{i}$ ) rather than raw gradients. This is achieved via:
1. Gradient Compression: Randomly subsampling dimensions of the gradient vector.
2. Noise Injection: Adding Gaussian noise.
3. Aggregation: Mixing gradients over time via EMA.
Weighted Aggregation: A client-relevance matrix is constructed using cosine similarity between sanitized gradients. Each client receives a global model that is a weighted average of other clients' local adapters, where weights are higher for clients with similar tasks.

B. Collaborative-LoRA (Co-LoRA)

To enable knowledge sharing across models with different architectures (different hidden dimensions $d$ and depths), the authors introduce Co-LoRA, a dimension-invariant module inserted into the standard LoRA framework.

Structure: Standard LoRA updates weights via $W + BA$ . Co-LoRA inserts dimension-invariant modules $P \in \mathbb{R}^{r \times r}$ and $Q \in \mathbb{R}^r$ between the standard LoRA matrices $A$ and $B$ . The output becomes $h_O = W_p h_I + B(P A h_I + Q)$ .
Shareability: Since $P$ and $Q$ depend only on the low-rank size $r$ (not the hidden dimension $d$ ), they can be directly shared and aggregated across heterogeneous models.
Alignment Strategies: To ensure $P$ $P$ and $Q$ $Q$ can be effectively aggregated across different architectures, the authors propose:
1. Block-wise Aggregation: Models are divided into blocks based on relative depth. Layers with similar relative depths (e.g., the 50th percentile of depth) are aligned using Centered Kernel Alignment (CKA) to ensure semantic correspondence.
2. Weight Alignment: Before federated training, the dimension-dependent matrices ( $A$ $A$ and $B$ $B$ ) of heterogeneous models are aligned to a common initialization using a public dataset ( $D_p$ $D_{p}$ ).
  - A Matrices: Aligned using L2 loss on the shared low-rank space.
  - B Matrices: Aligned using Canonical Correlation Analysis (CCA) to maximize correlation between output features.
  - Orthogonality: The authors enforce orthogonality on $A$ and $B$ to maximize the representational capacity (span) of the weight updates.

3. Key Contributions

DRAKE Benchmark: The authors introduce DRAKE, the first comprehensive multi-modal PFL benchmark that simulates realistic heterogeneity.
- Task Diversity: 40 distinct tasks across Visual Relation, Multi-modal Reasoning, and VQA.
- Dynamic Shifts: Clients encounter tasks sequentially, simulating real-world temporal distribution shifts.
- Unseen Evaluation: Includes 7 unseen tasks to test generalization.
- Multi-Image Support: Unlike prior benchmarks, DRAKE supports inputs with multiple images.
RELA Strategy: A novel aggregation method that uses sanitized, decayed gradients from a frozen model to measure task relevance, effectively reducing interference in data-heterogeneous settings.
Co-LoRA Module: A dimension-invariant adapter architecture that enables knowledge transfer between models of different families (e.g., Llama vs. Qwen) and scales, overcoming the architectural mismatch problem in PFL.
Comprehensive Evaluation: Extensive experiments demonstrating that FedMosaic outperforms State-of-the-Art (SOTA) PFL methods in both personalization (performance on own task) and generalization (performance on other clients' tasks).

4. Experimental Results

Performance: On the DRAKE and HFLB benchmarks, FedMosaic consistently outperforms baselines (e.g., DITTO, FedSim, PerAda, FedDAT).
- Personalization ('Self'): FedMosaic achieves higher accuracy on clients' own tasks compared to local Supervised Fine-Tuning (SFT) in complex, multi-image scenarios.
- Generalization ('Others'): It significantly improves performance on other clients' tasks, indicating better knowledge transfer.
Cross-Family Heterogeneity: The method remains effective even when clients use different model families (e.g., LLaVA-Llama vs. LLaVA-Qwen), demonstrating robustness to deep architectural differences.
Efficiency:
- Computation: FedMosaic incurs only ~16% additional computational cost compared to SFT (due to gradient computation and alignment), which is significantly lower than methods requiring dual adapters or logit distillation.
- Communication: By transmitting only the dimension-invariant $P$ and $Q$ modules (and sanitized gradients) rather than full LoRA weights, communication costs are reduced by ~11% compared to the most efficient baseline.
Fast Adaptation: Models initialized with FedMosaic adapt to new, unseen tasks significantly faster than those initialized with random weights or other PFL baselines.

5. Significance

This paper represents a significant step forward in making Federated Learning viable for real-world, large-scale AI deployment.

Realism: By moving beyond simplified non-IID splits and homogeneous model assumptions, it addresses the true complexity of the "Agentic AI" era where devices vary in capability and users have diverse, evolving needs.
Scalability: The Co-LoRA mechanism allows smaller devices to benefit from the knowledge of larger models (and vice versa) without requiring identical architectures, facilitating a more inclusive and efficient collaborative learning ecosystem.
Privacy & Efficiency: The integration of gradient sanitization and dimension-invariant modules ensures that privacy is maintained while minimizing the communication and computational overhead typically associated with complex PFL methods.

In summary, FedMosaic provides a robust, scalable, and privacy-preserving framework for personalizing multimodal models across a heterogeneous network of clients, setting a new standard for future research in federated foundation models.