pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models

The paper proposes pFedMMA, a personalized federated learning framework that utilizes multi-modal adapters with a globally shared projection to achieve state-of-the-art trade-offs between personalization and generalization for Vision-Language Models while maintaining communication efficiency.

Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you have a super-smart, world-class chef (the Vision-Language Model, like CLIP) who has learned to cook by tasting millions of dishes from every culture on Earth. This chef is amazing at recognizing a "tomato" or a "dog" just by looking at a picture and reading a description.

However, there's a problem: Privacy and Personalization.

In the real world, we can't just send all our private family photos and medical records to this central chef's kitchen. That's where Federated Learning comes in. Instead of sending the data, we send the chef's notes back and forth. But here's the catch: every family (or "client") has different tastes. One family loves spicy food; another hates cilantro. If the chef tries to make one "perfect" dish for everyone, it ends up tasting bland to everyone.

The Problem with Current Solutions

Previous attempts to solve this were like giving every family a different set of sticky notes (called "Prompts") to tell the chef what to do.

  • The Issue: Some families wrote notes so specific to their own kitchen that the chef forgot how to cook for anyone else. If you asked the chef to cook a dish for a stranger (a "new" class), the chef would get confused because the notes were too weird.
  • The Result: Great at cooking for your family, terrible at cooking for anyone else.

The Solution: pFedMMA (The "Universal Adapter" System)

The authors of this paper propose a new system called pFedMMA. Instead of just sticky notes, they give the chef a set of customizable kitchen gadgets (Adapters) that fit right onto the chef's existing tools.

Here is how it works, using a simple analogy:

1. The Three-Part Gadget

Imagine the gadget has three parts:

  • The Local Handle (Down-Projection): This is unique to your family. It adjusts the gadget to fit your specific kitchen counter height and your favorite spice rack. This part is never shared. It stays private so you can customize your experience perfectly.
  • The Universal Core (Shared Projection): This is the "brain" of the gadget. It's a small, universal connector that helps the gadget understand the difference between a "cat" and a "dog" in a way that makes sense to everyone. This part is shared with the central chef.
  • The Local Spout (Up-Projection): This is another unique part that pours the final sauce exactly how your family likes it. Like the handle, this stays private.

2. How They Learn Together

In this new system:

  1. Local Training: Every family tweaks their own Handles and Spouts to learn their specific tastes. They do this in their own kitchens without telling anyone else what they are doing.
  2. Global Sharing: Once in a while, they only send the Universal Core to the central chef.
  3. The Magic: The chef mixes all these "Universal Cores" together to create a super-smart, shared understanding of the world. Then, the chef sends this improved core back to everyone.

3. Why It's Better

  • Best of Both Worlds: Because you have your own unique handles, the system is great at personalizing for your specific data (Personalization). Because you all share the same "Universal Core," the system learns a common language that works for new things it hasn't seen before (Generalization).
  • Efficient: Since they only send the tiny "Universal Core" back and forth, it's like sending a postcard instead of a whole library. It saves a huge amount of bandwidth and time.

The Real-World Result

The paper tested this on 11 different datasets (like recognizing flowers, pets, food, and textures).

  • Old Methods: Either knew your family's taste perfectly but failed with strangers, or knew how to cook for everyone but tasted bad for your family.
  • pFedMMA: It found the perfect balance. It learned your family's specific preferences and still knew how to cook a delicious meal for a stranger walking through the door.

In a Nutshell

pFedMMA is like giving every user a custom-fitted suit (the local parts) that is tailored to their body, but all the suits are made using the same high-quality fabric pattern (the shared part) that ensures they all look good and fit the general style of the world. It allows AI to be both deeply personal and broadly smart, without needing to share private data.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →