FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment

Imagine you have a massive, incredibly smart library (a Large Language Model) that knows everything about the world. But, you want to teach it a very specific skill, like writing legal contracts or diagnosing rare diseases. You can't just rewrite the whole library because it's too huge and expensive. Instead, you use a clever trick called LoRA (Low-Rank Adaptation).

Think of LoRA like adding a small, specialized notebook to the library. Instead of rewriting the books, you just write new notes in this small notebook that tell the library how to handle specific tasks.

The Problem: The "Two-Notebook" Chaos

In the real world, this data is scattered across many different people's computers (clients), and they can't share their private data (like medical records or personal chats) with a central boss (the server). This is Federated Learning.

The old way of doing this with LoRA was like asking everyone to send two separate notebooks (let's call them Notebook A and Notebook B) to the boss.

The Mixing Error: The boss tries to combine everyone's Notebook A's into one big "Master A" and everyone's Notebook B's into one big "Master B." Then, the boss multiplies them together.
- The Analogy: Imagine asking 100 people to mix their own secret sauces. If you mix all the "salt" jars together and all the "pepper" jars together, then mix the two big piles, it tastes different than if you had mixed the salt and pepper together in each person's jar first. The math gets messy, and the final result is "drunk" or inaccurate.
The Drift Problem: To fix the mixing error, some researchers tried sending the combined result of the two notebooks. But then the boss has to break that big result back into two small notebooks.
- The Analogy: It's like taking a smoothie and trying to separate it back into exactly the original strawberries and bananas. There are a million ways to do it, and every time you do it, you might get slightly different strawberries. Over time, the "strawberries" change so much that the recipe stops working. This is called decomposition drift.

The Solution: FLoRG (The Single "Blueprint" Approach)

The authors of this paper propose a new method called FLoRG. They realized that instead of sending two notebooks, everyone should just send one single "Blueprint" (a Gram matrix).

Here is how FLoRG works, using a creative analogy:

1. The Shared Frame (The Semi-Orthogonal Basis)

Imagine everyone is building a house. Instead of everyone bringing their own random bricks, they all agree to use the same pre-built steel frame (Matrices L and R). This frame is rigid and shared by everyone.

The Innovation: Instead of sending two separate sets of instructions (A and B), everyone just sends a single sheet of paper called a Gram Matrix. Think of this as a "relationship map" that describes how the parts of the notebook fit together inside the shared frame.
Why it's better: When the boss collects these "relationship maps," they can simply add them up perfectly. There is no "salt vs. pepper" mixing error because the math is linear and clean. It's like adding up the total weight of ingredients rather than trying to guess the flavor profile.

2. The "Procrustes" Alignment (The Magic Mirror)

Even with the perfect "relationship map," when the boss breaks it down to send it back to the clients, there's still a risk of the "strawberries" changing (the drift problem).

The Analogy: Imagine you have a photo of a person (the previous round's notebook). The boss creates a new photo from the data, but the new photo is slightly rotated or stretched. If you just use the new photo, the person looks weird compared to the old one.
The Fix: FLoRG uses a technique called Procrustes Alignment. Think of this as a magic mirror that rotates and stretches the new photo just enough so that it matches the shape of the old photo perfectly, without changing the actual content (the Gram matrix).
The Result: The "person" in the photo looks exactly the same as they did yesterday, just with new information added. This prevents the "drift" and keeps the learning stable.

Why This Matters (The Results)

The paper shows that this new method is a game-changer:

Smarter Learning: It learns the new task better than the old methods because it doesn't make math errors or get confused by "drifting" instructions.
Super Fast Communication: Because clients only send one matrix instead of two, and the math is simpler, the amount of data sent over the internet is drastically reduced. The paper claims it can reduce communication costs by up to 2,041 times.
- Analogy: It's like switching from mailing a heavy, double-wrapped package to sending a single, lightweight postcard.

Summary

FLoRG is like upgrading a chaotic group project where everyone was sending two confusing, mismatched files.

Old Way: Send two files, mix them up, get confused, and drift apart.
FLoRG Way: Everyone agrees on a shared frame, sends one simple "relationship map," and uses a magic mirror to keep everything aligned.

The result is a smarter, faster, and more efficient way to teach AI models new skills without anyone having to share their private data.

1. Problem Statement

The paper addresses the challenges of applying Federated Learning (FL) to Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. While LoRA reduces memory and computation costs by updating two low-rank matrices ( $A$ and $B$ ) such that $\Delta W = BA$ , applying this in a federated setting introduces two critical issues:

Aggregation Error: In standard federated LoRA, clients upload separate matrices $A_n$ and $B_n$ . The server aggregates them separately ( $\bar{A}$ and $\bar{B}$ ) and computes the update as $\bar{A}\bar{B}$ . However, the true desired update is the average of the products ( $\frac{1}{N}\sum A_n B_n$ ). Due to the non-linearity of matrix multiplication, $\bar{A}\bar{B} \neq \frac{1}{N}\sum A_n B_n$ , leading to systematic aggregation errors that accumulate over rounds.
Decomposition Drift: To avoid aggregation error, some methods aggregate the product $A_n B_n$ directly. However, the server must then decompose this aggregated matrix back into two low-rank matrices for the next round. Matrix decomposition is not unique (especially with rank deficiency or repeated eigenvalues). Choosing different decompositions in consecutive rounds causes the parameter subspace to drift, destabilizing the gradient updates and degrading performance.

2. Methodology: FLoRG

The authors propose FLoRG (Federated Low-rank Gram-matrix aggregation), a framework designed to eliminate aggregation error and minimize decomposition drift.

A. Reparameterization with a Single Gram Matrix

Instead of maintaining two separate low-rank matrices ( $A$ and $B$ ), FLoRG reparameterizes the LoRA module using a single low-rank matrix $A_t \in \mathbb{R}^{r \times k}$ and two fixed, shared semi-orthogonal basis matrices $L$ and $R$ .

The update is defined as: $\Delta W_t = L (A_t)^\top A_t R$ .
Here, $Q_t = (A_t)^\top A_t$ is the Gram matrix (inner products of column vectors).
Aggregation: Clients upload their updated $A_t$ . The server aggregates the Gram matrices ( $Q_t$ ) linearly: $Q_{t+1} = \frac{1}{N} \sum (A_{n, t+1})^\top A_{n, t+1}$ .
Benefit: Aggregating Gram matrices is a linear operation that preserves the positive semi-definite (PSD) property, effectively eliminating the bilinear aggregation error found in standard LoRA. It also reduces communication overhead by transmitting only one matrix instead of two.

B. Procrustes Alignment for Decomposition

Since the server must recover a single matrix $A_{t+1}$ from the aggregated Gram matrix $Q_{t+1}$ for the next round, it performs eigendecomposition. To solve the non-uniqueness and rank mismatch problems:

The server computes a canonical decomposition $\tilde{A}_{t+1}$ such that $\tilde{A}_{t+1}^\top \tilde{A}_{t+1} = Q_{t+1}$ .
Procrustes Alignment: The server solves an optimization problem to find an orthogonal matrix $S_t$ $S_{t}$ that projects $\tilde{A}_{t+1}$ $\tilde{A}_{t + 1}$ onto the subspace of the previous round's matrix $A_t$ $A_{t}$ .
- Objective: $\min_{S_t} \| S_t \tilde{A}_{t+1} - A_t \|_F^2$ subject to $S_t^\top S_t = I$ .
Result: This selects the decomposition of $Q_{t+1}$ that is "closest" to the previous round's parameters, ensuring consistent gradient directions and stabilizing the fine-tuning process. The final update matrix is $A_{t+1} = S_t \tilde{A}_{t+1}$ .

3. Key Contributions

FLoRG Framework: A novel federated fine-tuning approach that aggregates Gram matrices instead of separate low-rank matrices, theoretically guaranteeing unbiased aggregation.
Procrustes Alignment: A mechanism to align decomposed matrices across rounds, minimizing drift caused by non-unique decompositions and rank mismatches.
Theoretical Analysis: The authors prove the convergence rate of FLoRG under non-convex loss settings. They demonstrate that the Procrustes alignment term reduces the convergence bound, leading to a tighter theoretical guarantee compared to methods without alignment.
Communication Efficiency: By transmitting a single matrix and aggregating Gram matrices, FLoRG significantly reduces uplink communication overhead compared to standard federated LoRA.

4. Experimental Results

The authors evaluated FLoRG on GLUE (MRPC, QQP, MNLI, QNLI, WNLI, RTE) and SQuAD datasets using base models OPT-125M, RoBERTa-large, and Llama-3.2-3B.

Performance: FLoRG outperformed five state-of-the-art baselines (FedIT, FeDeRA, FFA-LoRA, FedSA-LoRA, FedEx-LoRA) in testing accuracy across most datasets and model sizes. For example, on OPT-125M with the MNLI dataset, FLoRG improved accuracy by 1.52% over the strongest baseline.
Communication Overhead: FLoRG reduced communication overhead by up to 2041 $\times$ compared to baselines while achieving the same target accuracy. This is attributed to transmitting only one matrix and the efficiency of Gram matrix aggregation.
Ablation Studies:
- Procrustes Alignment: Removing this component resulted in a significant drop in accuracy (e.g., a 6.27% drop on MNLI for OPT-125M), confirming its necessity for stability.
- Rank Robustness: FLoRG maintained superior performance across different rank settings ( $r=2, 4, 8$ ).
- Data Heterogeneity: FLoRG showed greater robustness to non-IID data distributions (controlled by Dirichlet parameter $\rho$ ) compared to baselines.
- Initialization: Semi-orthogonal initialization for basis matrices $L$ and $R$ outperformed Kaiming and SVD initializations.

5. Significance

This paper provides a critical advancement in Federated Fine-Tuning of LLMs. It resolves the fundamental mathematical inconsistency (aggregation error) and the practical instability (decomposition drift) inherent in current federated LoRA methods. By introducing a Gram-matrix aggregation strategy coupled with Procrustes alignment, FLoRG offers a theoretically sound and practically efficient solution that enables high-performance collaborative fine-tuning with drastically reduced communication costs. This makes it highly suitable for privacy-sensitive, resource-constrained, and large-scale distributed LLM adaptation scenarios.