Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

The Big Picture: The "Secret Recipe" Problem

Imagine you want to train a super-smart robot chef (a Vision-Language Model) to cook amazing meals. This robot needs to learn from thousands of different recipes and cooking styles.

However, there's a problem: The best recipes are locked in the private kitchens of different people (hospitals, banks, private chefs). They can't share their actual ingredients or secret recipes (data) because of privacy laws.

The Old Way (Federated Learning):
In the past, to solve this, everyone would send their entire cookbook (the model's parameters) to a central hub to be mixed together.

The Problem: This is like trying to mix 100 different cookbooks into one giant book. If one person has a tiny, messy notebook (a weak computer or a bad model), it ruins the whole book. Also, sending entire cookbooks is heavy, slow, and if someone steals a page, they might figure out the original recipes.

The New Way (MoR - The Paper's Solution):
Instead of sharing the cookbooks, everyone shares feedback on how good a dish tastes.

The Analogy: Imagine the central hub sends out a "test dish" to everyone.
- The Hospital Chef says, "This is great, but it needs to be more sterile and precise."
- The Art Chef says, "This is beautiful, but the colors are wrong."
- The Budget Chef says, "This is too expensive to make."
They don't send their secret recipes. They just send a score or a preference (e.g., "I prefer the sterile version").

The paper proposes a system called MoR (Mixture-of-Rewards) that acts like a Smart Tasting Panel.

How MoR Works: The "Smart Tasting Panel"

The system has three main parts, which we can think of as a restaurant management team:

1. The Local Critics (Client Reward Models)

Each client (hospital, bank, etc.) trains their own local "Critic."

What they do: They taste the robot's test dishes and give a score based on their specific needs.
Why it's good: The Hospital Critic knows exactly what "medical accuracy" looks like. The Art Critic knows "visual beauty." They don't need to share their private data; they just learn what they like.

2. The Smart Waiter (The Routing Network)

This is the magic part. In the old days, the central hub would just take the average of all the critics' scores (like averaging a 1-star review with a 5-star review). That's bad because it confuses the robot.

What MoR does: It uses a Smart Waiter (a routing network).
How it works: When the robot tries to answer a question about a medical X-ray, the Smart Waiter looks at the question and says, "Hey, this is medical! Let's listen to the Hospital Critic and ignore the Art Critic."
If the question is about OCR (reading text on a sign), the Waiter says, "Let's listen to the Text Expert."
The Result: The robot gets the right advice for the right situation, without mixing up incompatible opinions.

3. The Head Chef (The Base Model)

The robot (the Head Chef) listens to the Smart Waiter's advice. It adjusts its cooking to make the dishes that the specific expert critic likes best. Over time, the robot becomes a master at everything because it knows exactly who to listen to for every specific task.

Why Is This Better? (The Benefits)

Privacy is King: No one ever sees the raw data (the secret ingredients). They only share "I liked this, I didn't like that."
No "One Size Fits All" Failures: In the old method, if you had a weak computer or a bad model in the mix, it dragged everyone down (the "bucket effect"). In MoR, the Smart Waiter simply ignores the weak critic for that specific task.
Speed and Efficiency: Sending a "score" or a "preference" is like sending a text message. Sending the whole model is like shipping a truckload of bricks. MoR is much faster and cheaper.
Adaptability: The Smart Waiter learns on the fly. If the robot starts making a new kind of mistake, the Waiter learns to switch critics faster to fix it.

The "Aha!" Moment

The paper argues that we are currently stuck in an era where we try to merge everyone's brains (parameters) to make a better AI. But in a world of privacy and different needs, we should instead merge everyone's opinions (preferences).

Think of it like this:

Old Way: Trying to merge the DNA of a fish, a bird, and a human to make a super-creature. It's messy and often fails.
MoR Way: Having a fish, a bird, and a human stand in a room. When you need to swim, you ask the fish. When you need to fly, you ask the bird. When you need to talk, you ask the human. You don't merge them; you route the question to the right expert.

Conclusion

MoR is a new framework that lets different organizations train a powerful AI together without sharing their private data. It uses a "Smart Waiter" to decide which expert's opinion matters most for each specific question. This makes the AI smarter, safer, and more respectful of privacy, especially when dealing with complex tasks like medical diagnosis or financial analysis.

1. Problem Statement

The paper addresses the challenge of aligning Vision-Language Models (VLMs) in Federated Learning (FL) settings, specifically focusing on client heterogeneity.

Context: VLMs are critical for privacy-sensitive domains (e.g., healthcare, finance), but strict data regulations prevent centralized training. Federated Learning (FL) offers a solution by keeping data local, but current FL paradigms rely on parameter sharing (aggregating model weights).
Limitations of Current FL:
1. Privacy Risks: Exchanged parameters can be exploited via gradient inversion attacks to reconstruct raw data.
2. Communication Overhead: Frequent transmission of large model weights is costly.
3. Heterogeneity: Clients often possess different model architectures, computational budgets, and application objectives (e.g., one client needs medical accuracy, another needs fine-grained detail). Standard parameter averaging (FedAvg) fails here, leading to the "bucket effect" where weak models degrade the performance of strong ones.
Core Hypothesis: Instead of sharing parameters, the future of FL lies in sharing preferences (reward signals). Preferences capture high-level user intent, are privacy-preserving, and can be adapted to heterogeneous model structures.

2. Methodology: MoR (Mixture-of-Rewards)

The authors propose MoR, a federated alignment framework based on Group Relative Policy Optimization (GRPO) and a Mixture-of-Rewards mechanism. The framework consists of three interconnected stages:

A. Decentralized Reward Model Training

Each client $k$ trains a local Reward Model ( $R_k$ ) using its private preference dataset ( $D_k^{pref}$ ).
These models capture client-specific evaluation criteria (e.g., medical correctness vs. visual detail) without exposing raw data.
Clients upload only the trained reward model parameters to a central server.

B. Federated Router Training (The "Mixture" Component)

Inspired by Mixture-of-Experts (MoE), a lightweight routing network ( $g_\phi$ ) is trained via FL to dynamically select and combine reward signals.
Process:
1. Clients download the set of all reward models $\{R_k\}$ from the server.
2. The router $g_\phi$ takes an input-response pair $(x, y)$ and outputs a weight vector $\alpha$ (via Softmax) indicating which reward model is most relevant.
3. The mixed reward is calculated as: $R_{mix}(x, y) = \sum \alpha_k R_k(x, y)$ .
4. The router is trained to minimize the negative log-likelihood of observed preferences, aggregating updates via FedAvg.
Efficiency: Only the lightweight router is federated, significantly reducing communication overhead compared to federating full policy models.

C. GRPO Policy Alignment with Online Router Updates

Policy Optimization: The base VLM is optimized using GRPO, which leverages relative comparisons within groups of candidate responses.
Reward Selection: During training, the router selects the most appropriate reward model for each input.
Online Adaptation (Neural Thompson Sampling):
- Challenge: The router is trained on static data, but the policy $\pi_\theta$ evolves during GRPO, causing a distribution shift (the router's training distribution no longer matches the on-policy data).
- Solution: The router is updated online during GRPO training. The problem is framed as a Contextual Bandit.
- Mechanism: The router uses Neural Thompson Sampling to balance exploration and exploitation. It maintains a posterior over routing parameters and selects a reward model based on the expected improvement in the global GRPO objective ( $\Delta J$ ). This allows the router to adapt to the shifting output distribution of the evolving VLM.

3. Key Contributions

Paradigm Shift: Proposes shifting from "parameter sharing" to "preference sharing" in FL, arguing it is more scalable and privacy-preserving for heterogeneous VLMs.
MoR Framework: Introduces a novel Mixture-of-Rewards architecture that uses a routing network to adaptively aggregate heterogeneous client reward signals, effectively handling diverse model architectures and domain preferences.
Online Adaptation: Develops an online router update mechanism using Neural Thompson Sampling to resolve the distribution mismatch between static router training and dynamic policy evolution.
Empirical Validation: Demonstrates that MoR outperforms existing baselines in generalization, robustness, and cross-client adaptability.

4. Experimental Results

The method was evaluated on three VQA benchmarks (Medical, OCR-like, and Detail Description) using a heterogeneous setup with different model architectures (e.g., Qwen2-VL, LLaVA-0.5B).

Performance vs. Baselines:
- Heterogeneous Setting: MoR significantly outperformed Avg RM (simple averaging), which suffered from the "bucket effect" (e.g., Avg RM score dropped to 4.97 in the Detail domain due to a weak 0.5B model, while MoR achieved 7.73).
- Homogeneous Setting: Even when all clients used the same architecture, MoR outperformed FedAvg and Pluralistic baselines in Average Score and Visual Faithfulness, proving the router's ability to select finer-grained signals.
- Robustness: MoR acted as a "generalist expert," maintaining high performance across all domains, whereas individual models showed significant domain-specific weaknesses.
Efficiency:
- MoR maintains O(1) computational complexity during training regarding the number of clients, whereas parameter-based FL scales linearly O(K).
- The router is lightweight, adding minimal overhead compared to the cost of training full policy models.
Ablation Studies:
- Disabling online router updates resulted in a slight performance drop, confirming the necessity of adapting the router to the evolving policy distribution.
- Query-level routing (fine-grained) outperformed batch-level routing (majority voting), highlighting the trade-off between efficiency and precision.

5. Significance

Privacy Preservation: By sharing only reward signals (or lightweight routers) instead of raw data or full model weights, MoR mitigates gradient inversion risks and respects data sovereignty.
Scalability: The decoupling of client heterogeneity from policy parameterization allows institutions with varying computational resources (e.g., a hospital with a small model vs. a research lab with a large model) to contribute to a unified, high-performing VLM.
Real-World Applicability: The framework provides a viable path for deploying AI in high-stakes, privacy-sensitive sectors like healthcare and finance, where data cannot be centralized but collaborative model improvement is essential.

In summary, MoR represents a significant advancement in federated learning for multimodal AI, solving the critical issues of heterogeneity and privacy by leveraging a dynamic, preference-based alignment strategy rather than static parameter aggregation.