Scalar Federated Learning for Linear Quadratic Regulator

Imagine you are the captain of a massive fleet of drones. Your goal is to teach all of them the perfect flight pattern to save the most battery and avoid obstacles. This is a classic "Linear Quadratic Regulator" (LQR) problem—a fancy way of saying "find the best control rule."

The problem? You can't just look at the math. You have to learn by doing. You have to send the drones out, let them crash a few times (or just wobble), measure the results, and then adjust their rules. This is called "model-free learning."

Here is the catch:

The Fleet is Huge: You have hundreds of drones.
The Data is Heavy: To figure out the perfect rule, each drone needs to send you a massive "instruction manual" (a huge list of numbers) back to the central server.
The Bandwidth is Tiny: Your radio connection is slow. Sending these huge manuals from hundreds of drones would clog the network instantly.
The Cost is Real: Every time a drone flies a test pattern, it burns battery and risks crashing. You don't want to waste these expensive "test flights."

The Old Way: FedLQR

Previously, researchers tried to solve this by having every drone send its full, heavy instruction manual back to the server.

Pros: The server gets a very clear picture.
Cons: It chokes the network. If you have 100 drones, the server has to process 100 huge files. It's like trying to download a 4K movie from 100 different friends at the same time on a dial-up connection.

The New Way: SCALARFEDLQR

The authors of this paper, Mohammadreza Rostami and colleagues, came up with a clever trick called SCALARFEDLQR.

Here is how it works, using a simple analogy:

The Analogy: The "Blindfolded Hiker" and the "Compass"

Imagine each drone is a hiker trying to find the bottom of a valley (the best policy).

The Old Way: Every hiker sends a detailed map of the entire terrain back to the base camp. This takes forever to transmit.
The New Way: Instead of sending a map, the base camp sends a random direction (like "North-East") to every hiker.
1. The hiker takes a tiny step in that direction and feels the slope.
2. They don't send a map. They just send back one single number: "How much steeper did it get?" (This is the scalar projection).
3. They also send a tiny "seed" (a password) so the base camp knows exactly which random direction they used.

Why is this magic?

Tiny Messages: Instead of sending a 100-page map, the drone sends a single postcard with one number on it. This reduces the data load from "O(d)" (huge) to "O(1)" (tiny), regardless of how complex the drone is.
The Magic of Numbers: The server receives thousands of these "one-number" messages. Because the directions were random but known (thanks to the seed), the server can mathematically reconstruct the average direction of the slope. It's like listening to a thousand people whispering "up" or "down" in random directions; if you average them out, you get a very accurate sense of which way the hill actually goes.
The "More the Merrier" Effect: Here is the coolest part. Usually, in math, throwing away information makes things worse. But here, the more drones you have, the better it gets.
- If you have 10 drones, the random noise might be a bit messy.
- If you have 1,000 drones, the random noise cancels itself out perfectly. The server gets a crystal-clear picture of the best direction to go, even though it only received tiny, one-number messages.

The Results

The paper proves two big things:

Safety: Even though the drones are sending tiny, incomplete messages, the math guarantees they will never fly into a wall or crash. They stay in the "safe zone" the whole time.
Speed: The fleet learns just as fast as the old method, but uses a fraction of the radio bandwidth.

The Bottom Line

Think of SCALARFEDLQR as a way to coordinate a massive army of robots without clogging the communication lines. Instead of shouting detailed battle plans, each soldier just whispers a single number into a walkie-talkie. When you combine thousands of those whispers, you get a perfect strategy, saving battery, time, and bandwidth.

It turns a communication bottleneck into a non-issue, allowing us to control huge fleets of robots (like drone swarms or self-driving cars) efficiently and safely.

1. Problem Formulation

The paper addresses the challenge of model-free policy optimization for a fleet of heterogeneous agents governed by Linear Quadratic Regulator (LQR) dynamics.

System Model: A network of $M$ agents, each with distinct but similar discrete-time linear time-invariant (LTI) dynamics ( $x^{(n)}_{t+1} = A^{(n)}x^{(n)}_t + B^{(n)}u^{(n)}_t$ ). The system matrices $(A^{(n)}, B^{(n)})$ are unknown.
Objective: To cooperatively learn a single common policy gain $K$ that minimizes the average LQR cost across all agents: $J_{avg}(K) = \frac{1}{M}\sum_{n=1}^M J^{(n)}(K)$ .
Constraints & Bottlenecks:
1. Communication Overhead: Standard Federated LQR (FedLQR) requires agents to transmit full gradient matrices of dimension $d = n_u \times n_x$ . This results in $O(d)$ uplink cost per agent and $O(Md)$ total server cost, which is prohibitive for large fleets or high-dimensional systems.
2. Sample Inefficiency: Model-free learning requires zeroth-order (ZO) gradient estimation via trajectory rollouts, which is expensive in physical systems (e.g., drones, power grids).
3. Stability: Under heterogeneous dynamics, a policy stabilizing one agent may destabilize another. The algorithm must ensure all iterates remain within the common stabilizing set $S = \bigcap S^{(n)}$ .

2. Methodology: SCALARFEDLQR

The authors propose SCALARFEDLQR, a communication-efficient federated algorithm that replaces full gradient transmission with scalar projections.

Core Mechanism:
1. Local Estimation: Each agent $n$ computes a local zeroth-order gradient estimate $\tilde{g}_{t,n}$ using trajectory rollouts.
2. Scalar Projection: Instead of sending $\tilde{g}_{t,n} \in \mathbb{R}^d$ , the agent samples a random Rademacher direction vector $v_{t,n} \in \{-1, +1\}^d$ (using a shared pseudorandom seed). It computes the scalar projection $r^n_t = \langle v_{t,n}, \tilde{g}_{t,n} \rangle$ .
3. Transmission: The agent uploads only the scalar value $r^n_t$ and the seed $\xi_{t,n}$ to the server.
4. Server Aggregation: The server regenerates the same vectors $v_{t,n}$ using the received seeds and reconstructs a global descent direction:
  $\bar{g}_t = \frac{d}{M} \sum_{n=1}^M r^n_t v_{t,n}$
5. Update: The server updates the global policy: $K_{t+1} = K_t - \eta \bar{g}_t$ .
Communication Complexity:
- Per-agent uplink: Reduced from $O(d)$ to $O(1)$ (one scalar + one seed).
- Total server cost: Reduced from $O(Md)$ to $O(M)$ .

3. Key Contributions & Theoretical Results

The paper provides rigorous theoretical guarantees regarding stability and convergence under standard regularity conditions (Local Smoothness and Polyak-Łojasiewicz (PL) condition).

Stability Guarantee (Theorem 1):
The authors prove that if the stepsize $\eta$ is chosen appropriately and the total gradient error (sum of zeroth-order estimation error and scalar projection reconstruction error) is bounded relative to the true gradient norm, all iterates remain within the common stabilizing set $S$ . This ensures the learned policy never destabilizes any agent in the fleet.
Linear Convergence (Theorem 2):
Under the PL condition, the algorithm achieves linear convergence to the optimal average cost. The convergence rate is:
$J_{avg}(K_t) - J^*_{avg} \leq \left( 1 - \frac{\mu_c(1-\beta)^2}{L_c(1+\beta)^2} \right)^t (J_{avg}(K_0) - J^*_{avg})$
where $\beta$ represents the relative error bound.
The "Large-Scale Advantage":
A critical theoretical insight is that the projection-induced error diminishes as the number of agents $M$ increases.
- The error bound scales roughly with $\sqrt{d/M}$ .
- Implication: Larger fleets allow for a smaller $\beta$ , which permits larger step sizes and faster convergence. Thus, SCALARFEDLQR becomes more efficient and stable as the fleet size grows, effectively decoupling communication cost from system dimension.

4. Numerical Results

The authors evaluated SCALARFEDLQR against standard FedLQR using simulated heterogeneous LTI systems ( $n_x=3, n_u=3$ ).

Performance vs. Rounds: When measured by communication rounds, SCALARFEDLQR achieves performance comparable to FedLQR, demonstrating that scalar aggregation preserves the essential learning dynamics.
Performance vs. Communication Cost:
- Under a fixed bit budget (e.g., $6 \times 10^5$ bits), SCALARFEDLQR significantly outperforms FedLQR.
- Low Heterogeneity: SCALARFEDLQR achieved 54.2% cost recovery vs. 29.1% for FedLQR.
- High Heterogeneity: SCALARFEDLQR achieved 30.7% recovery vs. 13.6% for FedLQR.
Conclusion: The method drastically reduces communication overhead while maintaining robustness to system heterogeneity.

5. Significance

Scalability: Solves the "curse of dimensionality" in federated control by making communication cost independent of the policy dimension ( $d$ ).
Privacy: The scalar projection mechanism inherently provides a degree of structural privacy, as full gradient matrices (which could reveal local dynamics) are never transmitted.
Practicality: Addresses the physical cost of data collection in real-world systems (e.g., battery usage in drones) by enabling efficient learning with minimal data transmission.
Theoretical Novelty: Establishes that in federated ZO-LQR, increasing the fleet size not only reduces sampling burden but also improves gradient recovery accuracy and convergence speed, a counter-intuitive benefit of large-scale collaboration.

In summary, SCALARFEDLQR offers a theoretically grounded, communication-efficient framework for deploying model-free control in large-scale, heterogeneous multi-agent systems, ensuring stability and fast convergence while minimizing bandwidth usage.

Scalar Federated Learning for Linear Quadratic Regulator

The Old Way: FedLQR

The New Way: SCALARFEDLQR

The Analogy: The "Blindfolded Hiker" and the "Compass"

Why is this magic?

The Results

The Bottom Line

1. Problem Formulation

2. Methodology: SCALARFEDLQR

3. Key Contributions & Theoretical Results

4. Numerical Results

5. Significance

More like this

Learning Kalman Policy for Singular Unknown Covariances via Riemannian Regularization

Sample entropy for graph signals: An approach to nonlinear dynamic analysis of data on networks

Finite-Step Invariant Sets for Hybrid Systems with Probabilistic Guarantees

Differentiable Invariant Sets for Hybrid Limit Cycles with Application to Legged Robots

Synchronous Observer Design for Landmark-Inertial SLAM with Magnetometer and Intermittent GNSS Measurements