Quantized Online LQR

The Big Picture: A Remote Pilot and a Local Co-Pilot

Imagine you are flying a very complex airplane (like a Boeing 747), but you are doing it from a control tower miles away. You can't see the plane directly; you only get tiny, blurry snapshots of its position sent to you over a very slow, narrow radio channel.

The Problem:
In the past, to control the plane, the pilot in the tower would have to send a command ("Turn left 5 degrees") every single second. Because the radio is slow, this takes up a lot of bandwidth. Also, because the radio is fuzzy, the commands get distorted, making the plane wobble and fly inefficiently. If the plane's physics change (e.g., it gets heavier or the wind changes), the pilot has no idea how to adjust because they don't know the plane's current "personality."

The New Idea:
This paper proposes a smarter way to fly. Instead of sending tiny, blurry snapshots of the plane's position every second, the plane (the "plant") does the heavy lifting locally.

The Plane's Job: The plane has a super-computer on board. It watches itself, figures out exactly how it's flying, and learns its own physics (how it responds to the rudder, the engines, etc.).
The Tower's Job: The tower knows the goal (fly efficiently, save fuel, stay safe) but doesn't know the plane's current physics.
The Handshake: The plane sends a summary of what it learned about its own physics to the tower. The tower uses this summary to calculate the perfect flight plan (the "policy") and sends that plan back to the plane.
The Execution: The plane executes the plan itself. Since the plane knows its exact position and the plan, it flies perfectly.

The Catch: The summary the plane sends must be tiny because the radio is slow. The paper asks: How small can we make this summary while still flying perfectly?

The Core Discovery: "Logarithmic" vs. "Linear"

The authors discovered a fundamental rule about how much data you need to send to learn and control a system.

The Old Way (Sending Raw Data): If you try to send the plane's position every second, you need a massive amount of data that grows with time. It's like trying to describe a movie by sending a photo of every single frame. The file size gets huge.
The New Way (Sending "Updates"): The paper proves that you only need to send the changes in what you've learned.
- Imagine you are learning a new language. At first, you make many mistakes and need to send long explanations. But as you get better, you only need to send tiny corrections ("No, it's this word, not that one").
- The paper shows that the total amount of data needed to control the plane perfectly over a long time only grows logarithmically.
- Analogy: If you fly for 100 hours, you might need 100 bits of data. If you fly for 10,000 hours, you don't need 10,000 bits; you only need a few hundred more. The "cost" of learning slows down drastically.

The Secret Sauce: The "Smart Ruler"

The hardest part of this paper is the math behind how to compress the data.

The authors realized that learning a system isn't uniform. Some parts of the system are easy to figure out quickly (like the weight of the plane), while others are tricky and take a long time to learn (like how the wind affects the tail).

The Mistake of the "One-Size-Fits-All" Ruler: If you use a standard ruler to measure everything, you have to make the ruler very precise to catch the tricky parts. This wastes space on the easy parts.
The "Smart Ruler" (Two-Scale Quantization): The authors invented a special measuring tool that has two speeds:
1. Fast Speed: For the easy-to-learn parts, it sends tiny, quick updates.
2. Slow Speed: For the tricky parts, it sends slightly larger, slower updates.
By mixing these two speeds, they ensure they never send too much data, but they never miss a critical detail. This allows the plane to learn perfectly without clogging the radio.

The "Safety Net"

There is a risk: What if the plane sends a summary that is slightly wrong, and the tower calculates a flight plan that crashes the plane?

The paper includes a "Safety Net" phase:

The Burn-in: At the very start, the plane uses a simple, safe, pre-programmed flight mode (like a training wheels mode) while it gathers data.
The Trigger: Once the plane is 99.9% sure it understands its own physics, it flips a switch. It sends a "Safe" signal to the tower.
The Handoff: The tower then starts sending the complex, optimized flight plans. If the plane ever starts to wobble too much, it instantly reverts to the simple "training wheels" mode.

Why This Matters

This research solves a major problem for the future of technology: The Internet of Things (IoT) and Autonomous Systems.

Battery Life: Drones, satellites, and self-driving cars often run on batteries. Sending huge amounts of data drains batteries fast. This method saves energy.
Bandwidth: In remote areas (like deep oceans or space), internet is slow. This method allows complex robots to work perfectly even with terrible internet connections.
Privacy: Instead of sending raw video or sensor data (which might reveal secrets), the robot only sends a mathematical summary of what it learned.

Summary in One Sentence

The paper proves that a robot can learn to control itself perfectly while talking to a remote brain using only a tiny, shrinking amount of data, by sending "updates on what it learned" rather than "raw snapshots of the world," using a clever two-speed compression trick to stay safe and efficient.

1. Problem Statement

The paper addresses the Online Linear-Quadratic Regulation (LQR) problem with unknown system dynamics under communication rate constraints.

Context: In standard online LQR, a plant observes its state and interacts with a controller to minimize a quadratic cost function. The system dynamics ( $x_{t+1} = Ax_t + Bu_t + w_t$ ) are initially unknown and must be learned online.
The Bottleneck: Classical networked control schemes typically quantize the raw plant state $x_t$ at every time step. This requires $O(T)$ total bits over a horizon $T$ and injects persistent quantization noise, which fundamentally limits control performance (regret).
The Setting: The authors propose a specific architecture with information asymmetry:
- The Plant: Has direct access to the state $x_t$ and can compute Ordinary Least Squares (OLS) estimates of the dynamics ( $\hat{A}, \hat{B}$ ). It has limited uplink bandwidth.
- The Controller: Has access to the cost matrices ( $R_x, R_u$ ) but not the state. It has an unconstrained downlink.
The Goal: Instead of transmitting raw states, the plant transmits learned dynamics estimates to the controller. The controller computes the optimal policy $K_t$ and sends it back. The plant then applies the control locally. The objective is to achieve the optimal regret scaling of $\tilde{O}(\sqrt{T})$ while minimizing the total uplink communication bits $B(T)$ .

2. Methodology

The paper introduces the Quantized Certainty Equivalent (QCE-LQR) algorithm, which relies on three core innovations:

A. Information-Theoretic Lower Bound (Converse)

The authors first prove a fundamental limit on communication. They show that to achieve a regret of $O(T^\alpha)$ for $\alpha \in [1/2, 1)$ , the system must transmit at least $\Omega(\log T)$ bits.

This implies that even with perfect knowledge of dynamics, the communication cost cannot be constant; it must grow logarithmically with the horizon to maintain sub-linear regret.
This establishes that $O(\log T)$ is the necessary and sufficient communication budget for optimal performance.

B. The QCE-LQR Algorithm

The algorithm operates in doubling epochs ( $k=1, 2, \dots$ ) and consists of two phases:

Pre-Safe Phase (Burn-in):
- The plant uses a known stabilizing controller $K_0$ with exploration noise.
- It computes OLS estimates of the dynamics.
- Once the estimates satisfy a statistical safety condition (based on confidence bounds), the system transitions to the "safe" phase.
- Initialization: The plant transmits the initial OLS estimate using Elias Gamma coding (absolute initialization) to establish a shared baseline model between the plant and controller.
Post-Safe Phase (Tracking):
- Differential Quantization: Instead of sending full estimates, the plant sends the innovation (difference) between the current OLS estimate and the previously shared estimate.
- Two-Scale Adaptive Quantization: This is the core technical contribution. The authors observe that OLS estimation error decays at two different rates depending on the parameter subspace:
  - Slow Rate ( $\tau^{-1/4}$ ): For the $d_x d_u$ dimensional subspace.
  - Fast Rate ( $\tau^{-1/2}$ ): For the $d_x^2$ dimensional subspace.
- A standard single-scale quantizer would be forced to track the slow rate, inflating the regret. QCE-LQR uses a mixed-scale schedule ( $s_k = c_{slow}\tau^{-1/4} + c_{fast}\tau^{-1/2}$ ) to match these decay rates.
- Adaptive Multiplier: To handle transient errors before the asymptotic rates kick in, the algorithm uses a dynamic multiplier $m_k$ (encoded via Elias Gamma) to expand the quantization radius temporarily, preventing overflow without increasing the asymptotic bit rate.
- Safety Projection: The controller projects the received quantized estimates onto a "safe set" to guarantee closed-loop stability before computing the new policy $K_t$ .

C. Regret Analysis

The authors prove that the quantization error introduces "inflation factors" ( $Q_{slow}$ and $Q_{fast}$ ) into the regret bound. As the codebook resolution $\varrho \to 0$ , these factors vanish, recovering the unquantized $\tilde{O}(\sqrt{T})$ regret. Crucially, the two-scale design ensures that the dominant dimension term ( $d_x d_u$ ) scales with $\sqrt{T}$ , while the secondary term ( $d_x^2$ ) is quarantined into the lower-order $\log T$ term.

3. Key Contributions

Fundamental Lower Bound: Proved that $\Omega(\log T)$ bits are necessary to achieve sub-linear regret in online LQR, even if the system knows the true dynamics. This sets a theoretical floor for communication-constrained adaptive control.
QCE-LQR Algorithm: Designed an algorithm that achieves $\tilde{O}(\sqrt{T})$ regret using only $O(\log T)$ total bits. It is the first scheme to match the optimal regret scaling under such strict rate constraints.
Two-Scale Quantization: Introduced a novel adaptive quantization protocol that accounts for the anisotropic convergence rates of OLS estimators. This prevents the "slow" estimation errors from dominating the communication cost and regret.
Explicit Trade-off: Derived a precise regret bound showing how quantization resolution ( $\varrho$ ) affects performance. The bound includes inflation factors that vanish as resolution increases, providing a smooth transition from quantized to unquantized performance.

4. Results

The authors validated their theory through numerical experiments on four benchmark systems:

Scalar Unstable Plant ( $d_s=2$ )
Double Integrator ( $d_s=6$ )
Inverted Pendulum ( $d_s=6$ )
Boeing 747 Lateral Model ( $d_s=24$ )

Key Findings:

Regret Performance: Over a horizon of $T=10,000$ , the Practical QCE-LQR achieved regret comparable to (and in some cases better than) the unquantized Certainty Equivalent controller.
Communication Efficiency: The total bits transmitted scaled logarithmically with time and linearly with system dimension ( $d_s$ $d_{s}$ ).
- Scalar system: ~123 bits.
- Boeing 747: ~819 bits.
Overhead: The quantization overhead was negligible. For the Boeing 747, the regret was only ~27.5% higher than the unquantized baseline in some trials, primarily due to the delayed "safe trigger" in high-dimensional spaces, not the quantization itself.
Bit Growth: The total bits grew as $\Theta(\log T)$ , confirming the theoretical prediction and breaking the $O(T)$ barrier of classical state quantization.

5. Significance

This work bridges a critical gap between adaptive control theory and networked control systems.

Theoretical Impact: It resolves the question of how much communication is strictly necessary for optimal learning in control. It proves that learning dynamics is far more communication-efficient than transmitting states.
Practical Impact: The results are highly relevant for IoT, Edge Computing, and Cloud Robotics, where devices have limited battery power (uplink) but can leverage powerful cloud servers (downlink). The proposed method allows these devices to learn complex dynamics and optimize control policies while transmitting only a tiny fraction of the data required by traditional methods.
Scalability: By isolating the dimension-dependent terms, the algorithm ensures that increasing system complexity (dimensions) does not exponentially increase the communication burden, making it scalable for large-scale systems like aircraft or power grids.

In summary, the paper demonstrates that transmitting learned models rather than raw states, combined with adaptive multi-scale quantization, allows for optimal control performance with minimal communication, fundamentally changing the design paradigm for rate-limited adaptive control systems.