FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

Imagine you are trying to teach a group of security guards (the AI model) how to recognize a specific person walking through a busy city, but you can't bring all the guards to the same training room. Instead, they are scattered across different cities, and each city has its own unique rules, lighting, and camera angles. This is the world of Federated Learning: training an AI without ever moving the private data from the local cameras.

The specific task here is Person Re-Identification (ReID): finding "Person A" in Camera 1, then finding "Person A" again in Camera 2, even if they are wearing different clothes or walking from a different angle.

The problem is that when these guards train together from afar, they get confused. Here is the paper's solution, explained simply:

The Two Big Problems

The paper identifies two main reasons why the AI gets confused in this "remote training" scenario:

The "Distracted Guard" (Background Noise):
Imagine a guard is looking for a person in a red shirt. But in one city, the background is also full of red walls and red cars. The AI gets distracted by the background and thinks, "Oh, that red wall must be the person!" It loses focus on the actual human.
The "Broken Puzzle" (Viewpoint Mismatch):
Imagine trying to recognize a friend by looking at a photo of their face, but in another photo, you only see their back. If the AI tries to match a face to a back, it fails. In the real world, cameras are at different heights and angles. The AI sees the "head" in one photo and the "feet" in another, and it doesn't know how to put those pieces together to say, "That's the same person."

The Solution: FedBPrompt

The authors propose a new system called FedBPrompt. Think of this as giving the security guards a set of specialized "sticky notes" (called Visual Prompts) that they can stick onto the camera feed to help them focus.

Instead of retraining the entire massive brain of the AI (which is slow and expensive to send over the internet), they only train these tiny, lightweight sticky notes.

Here is how the "Sticky Notes" work:

1. The "Body Part" Notes (Alignment)

To fix the "Broken Puzzle" problem, the system uses three specific sticky notes:

One note says: "Look at the Head/Shoulders."
One note says: "Look at the Torso."
One note says: "Look at the Legs."

These notes force the AI to pay attention to specific body parts, regardless of the angle. Even if the camera is tilted, the "Head" note knows to look at the top, and the "Legs" note knows to look at the bottom. This helps the AI understand that a head and a pair of legs belong to the same person, even if they look different from different angles.

2. The "Whole Person" Note (Focus)

To fix the "Distracted Guard" problem, there is one giant sticky note that says: "Ignore the background! Look at the WHOLE person!"

This note helps the AI ignore the red walls, the trees, or the cars. It tells the AI, "Don't get distracted by the scenery; focus entirely on the human shape."

The Magic Trick: "Freezing the Brain"

Usually, to update an AI, you have to send the entire "brain" (which is huge, like a 100MB file) back and forth between the city guards and the main office. This is slow and uses a lot of internet bandwidth.

The authors came up with a clever trick called Prompt-based Fine-Tuning (PFTS):

They take the main AI brain and freeze it (lock it in place so it can't change).
They only send the tiny sticky notes (the prompts) back and forth.
The Result: Instead of sending a 100MB file, they only send a 0.5MB file. It's like sending a 1-page memo instead of a whole encyclopedia. This makes the training incredibly fast and cheap, while still making the AI smarter.

Why This Matters

Privacy: The actual photos of people never leave the local cameras.
Speed: Because they only send tiny updates, the system learns much faster.
Accuracy: By forcing the AI to look at body parts and ignore backgrounds, it becomes much better at finding the right person, even in chaotic, crowded, or weirdly angled environments.

In short: The paper teaches a group of remote security guards to ignore the background noise and focus on specific body parts using tiny, efficient "mental notes," allowing them to recognize people accurately without needing to share private photos or heavy data files.

1. Problem Statement

The paper addresses Federated Domain Generalization for Person Re-Identification (FedDG-ReID). This task aims to train a global model across multiple decentralized clients (e.g., different camera networks) without sharing raw data, while ensuring the model generalizes to unseen target domains.

The authors identify two critical challenges exacerbated by client heterogeneity in this setting:

Background-Induced Defocusing: Vision Transformers (ViT), the current mainstream backbone, rely on global attention mechanisms. In FedDG-ReID, clients often have diverse and dominant backgrounds. The model tends to attend to these irrelevant background features rather than the pedestrian, leading to false matches between different individuals.
Viewpoint-Induced Misalignment: Clients capture pedestrians from diverse viewpoints. This causes severe misalignment of body parts (e.g., head, torso, legs) across different clients. Standard global attention fails to align these parts, drastically reducing feature similarity for the same individual and causing mismatches.

Additionally, updating full ViT models in a federated setting incurs prohibitive communication costs, making standard training inefficient for resource-constrained environments.

2. Methodology: FedBPrompt

The authors propose FedBPrompt, a framework integrating two core components: the Body Distribution Aware Visual Prompts Mechanism (BAPM) and the Prompt-based Fine-Tuning Strategy (PFTS).

A. Body Distribution Aware Visual Prompts Mechanism (BAPM)

BAPM introduces learnable visual prompts into the ViT architecture to explicitly guide attention toward pedestrian-centric regions. It partitions the prompt set into two functionally distinct groups:

Body Part Alignment Prompts ( $P_{upper}, P_{mid}, P_{lower}$ ):
- Purpose: To tackle viewpoint-induced misalignment.
- Mechanism: These prompts are assigned to specific spatial regions of the image (upper, middle, and lower body).
- Constrained Attention: A specialized attention mask is applied. Part-specific prompts can only attend to image patches within their corresponding spatial region. This forces the model to learn robust, part-level features regardless of the global viewpoint.
Holistic Full Body Prompts ( $P_{Full}$ ):
- Purpose: To tackle background-induced defocusing.
- Mechanism: These prompts attend to the entire image globally.
- Interaction: Crucially, all prompts (both part-specific and holistic) can attend to each other via self-attention. This allows the model to learn structured, part-level features while maintaining a coherent global context, ensuring consistency across diverse client viewpoints.

B. Prompt-based Fine-Tuning Strategy (PFTS)

To mitigate the high communication overhead of federated learning with large ViT models:

Frozen Backbone: The pre-trained ViT backbone parameters ( $\Theta_b$ ) are frozen on all clients and the server.
Lightweight Updates: Only the lightweight prompt parameters ( $\Theta_p$ ) are initialized randomly on clients and updated during local training.
Communication Efficiency: Only the prompt updates are transmitted to the server for aggregation. This reduces the communication volume to approximately 1% of the full model size (e.g., ~0.46M parameters vs. ~86M for the full model).

3. Key Contributions

FedBPrompt Framework: A novel approach for FedDG-ReID that uses learnable visual prompts to explicitly guide Transformer attention, effectively mitigating background bias and viewpoint misalignment.
BAPM Mechanism: A structured prompting design that separates prompts into "Part Alignment" and "Holistic Full Body" groups. By enforcing constrained local attention for parts and global attention for the whole, it achieves robust feature alignment and background suppression.
PFTS Strategy: A communication-efficient training protocol that freezes the heavy ViT backbone and updates only lightweight prompts, reducing communication costs by over 99% while maintaining high performance.
Plug-and-Play Integration: Both BAPM and PFTS are designed to be easily integrated into existing ViT-based FedDG-ReID frameworks.

4. Experimental Results

The method was evaluated on four large-scale Re-ID datasets (CUHK02, CUHK03, Market1501, MSMT17) under two protocols: Leave-One-Out (cross-domain) and Source-Domain Performance (in-domain).

Performance Gains:
- On the challenging "M+C2+C3 $\to$ MS" task, BAPM improved the strong baseline SSCU by 3.4% in mAP and 5.8% in Rank-1.
- Against weaker baselines like FedProx, the method achieved massive gains of 13.9% in mAP and 13.3% in Rank-1.
- The full-parameter strategy outperformed the state-of-the-art (SSCU) by an average of 3.3% mAP and 4.9% Rank-1 across all scenarios.
Communication Efficiency: PFTS achieved comparable performance to full-parameter training while communicating only ~0.46M parameters (vs. ~86M), validating its efficiency.
Ablation Studies:
- Removing either the Part Alignment Prompts or the Holistic Prompts resulted in performance drops, confirming that both components are necessary.
- The combination of both groups yielded the best results, proving that structural guidance is critical for handling misalignment.
Visualization:
- Attention Maps: Unlike baselines that scatter attention across backgrounds, FedBPrompt concentrates attention on the pedestrian body. Part prompts successfully localized specific body regions even under severe cropping and occlusion.
- Feature Space (t-SNE): Features learned by FedBPrompt showed significantly higher intra-domain compactness and inter-domain separability compared to baselines, particularly for the MSMT17 domain.

5. Significance

This work makes a significant contribution to the field of Federated Learning and Computer Vision by:

Solving the "ViT Limitation" in FL: It addresses the specific weakness of ViTs (global attention distraction) in heterogeneous federated environments, a problem often overlooked in standard DG-ReID.
Bridging Privacy and Performance: It demonstrates that high-performance domain generalization is achievable without centralizing data, while simultaneously solving the communication bottleneck that usually plagues ViT-based federated training.
Providing a Generalizable Solution: The proposed mechanisms (BAPM and PFTS) are not tied to a specific dataset but offer a flexible architecture for future federated vision tasks involving distribution shifts.

The code is publicly available, facilitating further research in federated person re-identification.