ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

Imagine you have a massive, incredibly complex library of knowledge (a Large Language Model, or LLM) that you want to teach a new, specific skill, like writing poetry in the style of Shakespeare.

In the old days, to teach this library, you'd need a giant, super-expensive supercomputer (a server with huge VRAM) to do all the heavy lifting. But what if you wanted to teach this library using thousands of regular laptops or phones scattered around the world, without anyone ever sharing their private notes? That's Federated Learning.

However, there's a problem: These "regular" devices don't have enough memory (VRAM) to hold the whole library and the "homework notes" (gradients) needed to learn. Also, sending all the homework back and forth to the central teacher takes forever and clogs the internet.

Enter ZorBA (Zeroth-order Federated Fine-tuning with Heterogeneous Block Activation). Think of ZorBA as a clever, resourceful study group leader who figures out how to teach this giant library using only small, underpowered devices.

Here is how ZorBA works, broken down into simple concepts:

1. The "No-Notes" Trick (Zeroth-Order Optimization)

Usually, to learn something, you need to write down exactly why you got an answer wrong (calculating gradients via backpropagation). This requires a lot of memory.

The ZorBA Way: Instead of writing down the "why," ZorBA uses a "guess and check" method. It asks the model: "What happens if I nudge this tiny part of the library slightly?" and "What happens if I nudge it the other way?" By comparing the results of these two guesses, it figures out the direction to improve without needing to store the complex "why" notes.
Analogy: Imagine trying to find the bottom of a dark valley. The old way is to map the entire slope (requires a big map/memory). ZorBA just takes two small steps forward and backward to see which way is downhill. It's simpler and needs less memory.

2. The "Specialized Team" (Heterogeneous Block Activation)

The library is made of thousands of chapters (called "blocks"). If every student tries to read and update every chapter, their small laptops will crash from memory overload.

The ZorBA Way: The central teacher (Server) looks at each student's laptop.
- Student A has a powerful laptop? "You read Chapters 1 through 10."
- Student B has a weak laptop? "You just read Chapters 1, 5, and 9."
- Student C has a tiny phone? "You just read Chapter 3."
The Magic: Even though everyone is working on different parts, the teacher combines their insights to update the whole library. This ensures no one's computer crashes, and the group learns faster because everyone is focusing on what they can handle.

3. The "Secret Handshake" (Shared Random Seeds)

Usually, to coordinate the "guess and check" steps, the teacher has to send huge lists of random numbers to every student. This creates a traffic jam on the internet.

The ZorBA Way: The teacher and all students agree on a single "Secret Handshake" (a shared random seed) at the start.
The Magic: Because they all have the same "seed," they can independently generate the exact same list of random numbers. The teacher doesn't need to send the list; they just say, "Use Seed #42." The students instantly know what numbers to use. This saves a massive amount of data transmission.

4. The "Smart Scheduler" (The Optimization Algorithm)

The hardest part is deciding who reads which chapters. If you give too many chapters to a weak laptop, it crashes. If you give too few to a strong laptop, the group learns slowly.

The ZorBA Way: The paper introduces a mathematical "scheduler" (an algorithm) that acts like a master chef. It calculates the perfect menu for every student:
- "You have 4GB of memory? Here are 3 chapters."
- "You have 12GB? Here are 8 chapters."
- "But make sure Chapter 5 is covered by at least three people so we don't miss anything."
The Result: The group learns as fast as possible without anyone's computer exploding.

Why is this a big deal?

The paper tested ZorBA against other methods and found:

Memory Savings: It reduced the memory needed on devices by up to 62%. That's like turning a supercomputer task into something a gaming laptop can handle.
Speed: It learned faster than other "guess and check" methods because it assigned the right tasks to the right people.
Efficiency: It barely used any internet bandwidth because of the "Secret Handshake" trick.

In a nutshell: ZorBA is a smart, collaborative way to teach giant AI models using thousands of small, weak devices. It does this by splitting the work up based on who has the most power, using a clever "guess and check" method to save memory, and using a shared secret code to save internet data. It turns a impossible task into a manageable group project.

1. Problem Statement

The paper addresses the significant challenges in applying Federated Learning (FL) to the fine-tuning of Large Language Models (LLMs). Two primary bottlenecks hinder current approaches:

VRAM Constraints: Conventional FL relies on backpropagation (BP) to compute first-order gradients. For LLMs with billions of parameters, storing these gradients and the necessary forward-pass activations (hidden states, Q/K/V projections, FFN outputs) exceeds the Video Random Access Memory (VRAM) capacity of resource-constrained client devices (e.g., edge GPUs).
Communication Overhead: Frequent exchange of high-dimensional model updates or gradients between clients and the central server creates massive communication costs.
Black-Box/Non-Differentiable Limitations: Standard first-order methods fail when models contain non-differentiable operators or when the model is treated as a black box where gradients are unavailable.

While Zeroth-order (ZO) optimization (using forward passes only) eliminates the need to store gradients, existing ZO-based FL methods still suffer from:

Slow Convergence: High-dimensional parameter spaces introduce variance in gradient estimates.
High VRAM Usage: ZO still requires storing forward-pass activations for all blocks, which scales linearly with the number of activated blocks.
Communication Costs: Transmitting full estimated gradients remains expensive.

2. Methodology: The ZorBA Framework

The authors propose ZorBA, a framework that integrates zeroth-order optimization with a Heterogeneous Block Activation mechanism.

A. Core Components

Zeroth-Order Optimization (ZO):
- Replaces backpropagation with a forward-pass-only approach.
- Estimates gradients using finite differences of loss function values generated by random perturbation vectors ( $v$ ).
- Formula: $\tilde{\nabla} F \approx \frac{F(w + \mu v) - F(w)}{\mu} v$ .
- This eliminates the need to store gradients, significantly reducing VRAM usage.
Heterogeneous Block Activation:
- Instead of activating all transformer blocks for every client, the central server assigns a subset of blocks to each client based on their specific VRAM capacity.
- Clients update only their assigned blocks; other blocks remain frozen.
- This allows clients with lower VRAM to participate by activating fewer blocks, while clients with higher VRAM activate more, optimizing the global convergence rate.
Shared Random Seeds & Finite Difference Transmission:
- Shared Seeds: The server and clients share random seeds to generate identical perturbation vectors locally. This removes the need to transmit the perturbation vectors themselves.
- Finite Difference Transmission: Instead of sending the full estimated gradient vector (which is high-dimensional), clients only transmit the scalar finite differences ( $\rho$ ) of the loss function. The server reconstructs the gradient using the shared seeds and received scalars. This drastically reduces communication overhead.

B. Theoretical Analysis & Optimization

The authors derive a dimension-free convergence bound for ZorBA in non-convex settings.

Key Insight: The convergence rate is governed by a metric $\Lambda(A)$ , which depends on the block activation matrix $A$ .
$\Lambda(A)$ Definition: It is related to the "least popularity" of blocks across clients. Specifically, minimizing $\Lambda(A)$ requires maximizing the minimum number of clients activating any given block (flattening the distribution of block activations).
Trade-off: There is a fundamental trade-off between convergence speed (requiring more activated blocks) and VRAM usage (requiring fewer activated blocks).

C. Optimization Algorithm ( $\epsilon$ -constraint Lexicographic)

To jointly optimize convergence and VRAM usage, the authors formulate a multi-objective problem and solve it using a two-stage algorithm:

Stage 1 (Maximize Least Popularity): Use a closed-form expression and Dinic's algorithm (max-flow min-cut) to determine the maximum possible "least popularity" ( $\gamma^*$ ) of blocks across all clients under VRAM constraints. This ensures the most balanced block distribution.
Stage 2 (Greedy Adjustment): A greedy algorithm activates additional blocks to minimize the number of clients stuck at the minimum popularity $\gamma^*$ .
Pareto Front: By varying the VRAM reduction constraints ( $\epsilon$ ), the algorithm generates a Pareto front, allowing the selection of an activation matrix that balances convergence speed and memory usage.

3. Key Contributions

ZorBA Framework: A novel federated fine-tuning framework combining ZO optimization with heterogeneous block activation to address VRAM and communication bottlenecks.
Theoretical Guarantees: Provided a dimension-free convergence analysis for ZO-based FL, proving that convergence depends on the distribution of block activations rather than just the total model dimension.
Optimization Algorithm: Proposed an efficient $\epsilon$ -constraint lexicographic algorithm to solve the NP-hard block activation problem, decoupling it into a max-flow problem and a greedy update.
Communication Efficiency: Introduced a mechanism using shared random seeds and finite difference transmission to reduce communication overhead by orders of magnitude compared to gradient exchange.

4. Experimental Results

Experiments were conducted on AG-News, SST-2, and SNLI datasets using OPT-125M and OPT-1.3B models, comparing ZorBA against FedIT (First-order BP), FedZO (Standard ZO), and DeComFL.

VRAM Reduction: ZorBA reduced total VRAM usage by up to 62.41% compared to baselines (FedIT, FedZO, DeComFL) while maintaining target accuracy.
Convergence Speed: ZorBA converged faster than FedZO and DeComFL (up to 23.76% fewer rounds in some cases), validating that optimizing block activation improves convergence even with ZO.
Communication Overhead: ZorBA incurred negligible communication overhead compared to FedIT and FedZO, and was comparable to DeComFL, due to the transmission of scalar finite differences instead of high-dimensional vectors.
Trade-off Analysis: Results confirmed the theoretical Pareto front, showing that increasing activated blocks improves convergence but increases VRAM, allowing users to select an optimal operating point.

5. Significance

This work is significant because it makes Federated Fine-tuning of LLMs feasible on resource-constrained edge devices.

It overcomes the "memory wall" by eliminating gradient storage and selectively activating blocks.
It solves the "communication wall" by avoiding the transmission of massive gradient vectors.
It provides a theoretical foundation for understanding how to distribute computational load (blocks) across heterogeneous clients to maximize learning efficiency without requiring full-model updates on every device.
The approach enables the deployment of large-scale collaborative LLM adaptation in privacy-sensitive and bandwidth-limited environments (e.g., mobile networks, IoT).

ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

1. The "No-Notes" Trick (Zeroth-Order Optimization)

2. The "Specialized Team" (Heterogeneous Block Activation)

3. The "Secret Handshake" (Shared Random Seeds)

4. The "Smart Scheduler" (The Optimization Algorithm)

Why is this a big deal?

1. Problem Statement

2. Methodology: The ZorBA Framework

A. Core Components

B. Theoretical Analysis & Optimization

C. Optimization Algorithm (ϵ\epsilonϵ-constraint Lexicographic)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

C. Optimization Algorithm ( $\epsilon$ -constraint Lexicographic)