Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

This paper proposes MoR, a federated alignment framework that replaces parameter exchange with preference-based learning using a Mixture-of-Rewards mechanism and GRPO to effectively align heterogeneous Vision-Language Models while preserving data privacy and accommodating diverse client constraints.

Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng

Published 2026-03-06
📖 5 min read🧠 Deep dive

The Big Picture: The "Secret Recipe" Problem

Imagine you want to train a super-smart robot chef (a Vision-Language Model) to cook amazing meals. This robot needs to learn from thousands of different recipes and cooking styles.

However, there's a problem: The best recipes are locked in the private kitchens of different people (hospitals, banks, private chefs). They can't share their actual ingredients or secret recipes (data) because of privacy laws.

The Old Way (Federated Learning):
In the past, to solve this, everyone would send their entire cookbook (the model's parameters) to a central hub to be mixed together.

  • The Problem: This is like trying to mix 100 different cookbooks into one giant book. If one person has a tiny, messy notebook (a weak computer or a bad model), it ruins the whole book. Also, sending entire cookbooks is heavy, slow, and if someone steals a page, they might figure out the original recipes.

The New Way (MoR - The Paper's Solution):
Instead of sharing the cookbooks, everyone shares feedback on how good a dish tastes.

  • The Analogy: Imagine the central hub sends out a "test dish" to everyone.
    • The Hospital Chef says, "This is great, but it needs to be more sterile and precise."
    • The Art Chef says, "This is beautiful, but the colors are wrong."
    • The Budget Chef says, "This is too expensive to make."
  • They don't send their secret recipes. They just send a score or a preference (e.g., "I prefer the sterile version").

The paper proposes a system called MoR (Mixture-of-Rewards) that acts like a Smart Tasting Panel.


How MoR Works: The "Smart Tasting Panel"

The system has three main parts, which we can think of as a restaurant management team:

1. The Local Critics (Client Reward Models)

Each client (hospital, bank, etc.) trains their own local "Critic."

  • What they do: They taste the robot's test dishes and give a score based on their specific needs.
  • Why it's good: The Hospital Critic knows exactly what "medical accuracy" looks like. The Art Critic knows "visual beauty." They don't need to share their private data; they just learn what they like.

2. The Smart Waiter (The Routing Network)

This is the magic part. In the old days, the central hub would just take the average of all the critics' scores (like averaging a 1-star review with a 5-star review). That's bad because it confuses the robot.

  • What MoR does: It uses a Smart Waiter (a routing network).
  • How it works: When the robot tries to answer a question about a medical X-ray, the Smart Waiter looks at the question and says, "Hey, this is medical! Let's listen to the Hospital Critic and ignore the Art Critic."
  • If the question is about OCR (reading text on a sign), the Waiter says, "Let's listen to the Text Expert."
  • The Result: The robot gets the right advice for the right situation, without mixing up incompatible opinions.

3. The Head Chef (The Base Model)

The robot (the Head Chef) listens to the Smart Waiter's advice. It adjusts its cooking to make the dishes that the specific expert critic likes best. Over time, the robot becomes a master at everything because it knows exactly who to listen to for every specific task.


Why Is This Better? (The Benefits)

  1. Privacy is King: No one ever sees the raw data (the secret ingredients). They only share "I liked this, I didn't like that."
  2. No "One Size Fits All" Failures: In the old method, if you had a weak computer or a bad model in the mix, it dragged everyone down (the "bucket effect"). In MoR, the Smart Waiter simply ignores the weak critic for that specific task.
  3. Speed and Efficiency: Sending a "score" or a "preference" is like sending a text message. Sending the whole model is like shipping a truckload of bricks. MoR is much faster and cheaper.
  4. Adaptability: The Smart Waiter learns on the fly. If the robot starts making a new kind of mistake, the Waiter learns to switch critics faster to fix it.

The "Aha!" Moment

The paper argues that we are currently stuck in an era where we try to merge everyone's brains (parameters) to make a better AI. But in a world of privacy and different needs, we should instead merge everyone's opinions (preferences).

Think of it like this:

  • Old Way: Trying to merge the DNA of a fish, a bird, and a human to make a super-creature. It's messy and often fails.
  • MoR Way: Having a fish, a bird, and a human stand in a room. When you need to swim, you ask the fish. When you need to fly, you ask the bird. When you need to talk, you ask the human. You don't merge them; you route the question to the right expert.

Conclusion

MoR is a new framework that lets different organizations train a powerful AI together without sharing their private data. It uses a "Smart Waiter" to decide which expert's opinion matters most for each specific question. This makes the AI smarter, safer, and more respectful of privacy, especially when dealing with complex tasks like medical diagnosis or financial analysis.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →