Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part II

Imagine you are trying to teach a robot to drive a car, but the robot has a very strange problem: it can't see the road directly. Instead, it only sees a blurry, distorted, and high-resolution video feed from a camera that captures everything—the road, the trees, the clouds, and even the birds flying by.

The robot needs to figure out two things to drive safely:

Where is the car actually going? (The "State")
What should I do next to avoid crashing? (The "Control")

This paper is about teaching the robot how to build a mental map (a "latent model") of the world from that blurry video feed, so it can drive perfectly without needing to know the exact physics of the car or the road beforehand.

Here is the breakdown of their solution, explained with everyday analogies.

1. The Problem: The "Blind" Driver

In the real world, robots often get overwhelmed by too much data. If you try to teach a robot to drive by showing it every single pixel of the video, it gets confused by irrelevant details (like a bird flying by).

The Old Way (Reconstruction): Previous methods tried to teach the robot to rebuild the video perfectly. "If I see a tree here, I should be able to draw that tree."
- The Flaw: The robot wastes energy learning about the birds and the clouds, which don't help it drive. It's like studying the entire encyclopedia just to learn how to change a tire.
The New Way (Cost-Driven): This paper suggests a smarter approach: Don't try to see the world; try to predict the score.
- The Analogy: Imagine playing a video game. You don't need to know the texture of every tree to win; you just need to know: "If I turn left, I get 10 points. If I hit a wall, I lose 100 points."
- The robot learns a mental model by asking: "What sequence of actions will lead to the best score (lowest cost)?" It ignores the birds and focuses only on the things that affect the score.

2. The Two Methods: "Explicit" vs. "Implicit"

The authors test two different ways to teach the robot this "score-predicting" skill.

Method A: The "Explicit" Map Maker (CoReL-E)

This method is like a student who draws a map step-by-step.

The robot watches the video and guesses the car's position.
It then tries to predict: "If I am here and I turn the wheel, where will I be next?"
It checks its prediction against reality. If it's wrong, it fixes the map.

The Result: It builds a very clear, step-by-step understanding of how the car moves.

Method B: The "Implicit" Dreamer (CoReL-I / MuZero Style)

This method is inspired by MuZero, the AI that beat humans at Chess and Go. This is the "cool" method.

The robot doesn't try to draw a map of "where I am."
Instead, it plays a game of "What if?" in its head. It asks: "If I do this action, what will my score be in the future?"
It learns the rules of the game (how the car moves) purely by trying to predict the future score accurately.

The Analogy: Think of a chess grandmaster. They don't necessarily visualize the exact coordinates of every piece on a grid. They have a "feeling" for the board state that predicts who will win. They learn the consequences of moves, not just the geometry of the board.

3. The Big Challenge: The "Coordinate Misalignment"

Here is the tricky part the authors discovered.

Imagine you are teaching a robot to recognize a "red ball."

Scenario 1: You show it a red ball. It learns "Red = Ball."
Scenario 2: You show it the same ball, but rotated 90 degrees. It learns "Red (rotated) = Ball."

In the "Implicit" method (Method B), the robot is great at predicting the score, but it might get confused about how the ball is oriented. It might think the ball is in a different "coordinate system" than the one you are using.

The Metaphor: It's like two people speaking the same language but using different dialects. They can understand the meaning (the cost/score), but they can't agree on the direction (the specific state coordinates).
The Fix: The authors invented a mathematical "translator" (an alignment matrix) to make sure the robot's internal map matches the real world's map, even if it learned the rules implicitly.

4. The "Magic" Ingredient: Persistence of Excitation

To prove their math works, the authors had to solve a problem about data correlation.

The Problem: If you watch a car drive for 10 seconds, the view at second 5 is very similar to the view at second 6. In math, this is called "correlated data." Usually, math hates correlated data because it makes it hard to learn new things.
The Solution: The authors proved that even though the data is correlated, if you wait long enough and look at the "bumps" in the data (the noise), the robot eventually sees enough variety to learn the rules perfectly.
The Analogy: Imagine trying to learn the wind pattern by watching a single leaf flutter. At first, the leaf just flutters randomly. But if you watch it for a long time, you start to see the pattern of the wind, even though every second looks slightly different. They proved mathematically that the robot will eventually "see" the wind pattern clearly enough to drive perfectly.

5. The Bottom Line

This paper proves that you don't need to know the physics of the world to control it.

If you have a robot that can:

Watch a video feed.
Predict the "score" (cost) of future actions.
Ignore the irrelevant background noise.

...then you can mathematically guarantee that the robot will learn to drive (or control any system) almost as well as an expert, even if it starts with zero knowledge of the system.

In short: They took a complex, high-dimensional problem (controlling a robot with a blurry camera) and showed that a simple strategy—"Predict the future score, and the rest will follow"—is not just a good guess, but a mathematically proven path to success.

Here is a detailed technical summary of the paper "Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part II" by Tian et al.

1. Problem Statement

The paper addresses the problem of state representation learning for control in Partially Observable Linear Quadratic Gaussian (LQG) systems.

Context: The system is a time-invariant, infinite-horizon LTI dynamical system where the true state $x_t$ is hidden. The agent only observes $y_t$ (which includes noise) and must learn a policy to minimize a quadratic cost function $c(x, u) = \|x\|_{Q^*}^2 + \|u\|_{R^*}^2$ .
Challenge: The system parameters ( $A^*, B^*, C^*, Q^*, R^*, \Sigma_w, \Sigma_v$ ) are unknown. The agent must learn a latent state representation $z_t$ from a history of observations and actions, and a latent dynamical model, to derive an optimal controller.
Goal: To establish finite-sample guarantees for finding a near-optimal representation function and a near-optimal controller using a cost-driven approach, specifically comparing explicit transition learning against implicit learning (inspired by MuZero).

2. Methodology

The authors propose a framework called Cost-Driven State Representation Learning, which learns the latent model by predicting cumulative costs rather than reconstructing observations. The framework consists of three main stages:

A. Cost-Driven Representation Learning

Instead of learning to reconstruct the observation $y_t$ from $z_t$ (which is task-agnostic and prone to distraction), the method learns a representation $M$ that predicts the cumulative cost over a horizon.

Objective: Solve a quadratic regression problem to estimate a matrix $N^* = (M^*)^\top \mathcal{Q}^* M^*$ , where $\mathcal{Q}^*$ is the observability Gramian of the cost.
Mechanism: By minimizing the squared error between the predicted cost $\|M h_t\|^2 + b$ and the actual cumulative cost $c_t$ , the algorithm recovers the state representation matrix $M$ (up to an orthogonal transformation).
Key Insight: This step is decoupled from learning the transition dynamics, allowing the representation to be learned independently.

B. Latent Model Learning (Two Approaches)

Once the representation $\hat{M}$ is learned, the paper investigates two methods to learn the latent dynamics $(A, B)$ :

Explicit Learning (CoReL-E):
- Uses Ordinary Least Squares (OLS) to minimize the transition prediction error in the latent space: $\min \sum \|A\hat{z}_t + B u_t - \hat{z}_{t+1}\|^2$ .
- This is the standard system identification approach.
Implicit Learning (CoReL-I / MuZero-style):
- Inspired by MuZero, this approach learns dynamics implicitly by minimizing the error in future cost predictions.
- It optimizes a loss function that aggregates cost prediction errors over multiple time steps: $\min \sum (\|z_{t,i}\|^2_Q + \|u_{t+i}\|^2_R + b - c_{t+i})^2$ .
- Technical Innovation: The authors identify a coordinate misalignment problem. Since costs are invariant to orthogonal transformations of the latent state, learning dynamics implicitly via cost prediction might recover the dynamics in a rotated coordinate system. To fix this, they introduce an alignment matrix $\hat{S}_0$ (solved via a linear regression) to align the coordinates of the representation learned in step A with the dynamics learned in step B.

C. Policy Synthesis

Once the latent model $(\hat{A}, \hat{B}, \hat{Q}, R^*)$ is identified, the optimal feedback gain $\hat{K}$ is computed by solving the Discrete-time Algebraic Riccati Equation (DARE) within the latent space. The final policy is $\pi = (\hat{M}, \hat{K})$ .

3. Key Technical Contributions

The paper makes several significant theoretical contributions to handle the complexities of the infinite-horizon, time-invariant setting with correlated data:

Finite-Sample Guarantees: The authors prove that both CoReL-E and CoReL-I achieve a suboptimality gap of $J(\hat{\pi}) - J(\pi^*) = O(\text{poly}(\cdot) T^{-1})$ with high probability, where $T$ is the trajectory length.
Persistency of Excitation (PoE) for Quadratic Regression:
- A major hurdle is that the regressors in cost-driven learning are quadratic forms of the history ( $h_t h_t^\top$ ), and the data is correlated (single trajectory).
- The authors prove a new Block Martingale Small-Ball (BMSB) condition for the process $[svec(h_t h_t^\top); 1]$ .
- They utilize the Small-Ball Method and a Gram-Schmidt partitioning technique to handle the correlation in the single trajectory, proving that widely separated samples in a mixing process are "almost independent."
Handling Coordinate Misalignment: They formally characterize the issue where implicit dynamics learning (MuZero-style) fails to recover the correct coordinate system due to the rotational invariance of costs, and propose a provable alignment procedure.
Analysis of Correlated Data: Unlike Part I (finite-horizon), this work aggregates data across time steps in a stationary setting. The authors develop new concentration inequalities for sums of correlated sub-exponential variables arising from quadratic regressions.

4. Main Results

Sample Complexity: The sample complexity scales polynomially with the system dimensions ( $d_x, d_y, d_u$ ), the horizon length $H$ , and logarithmically with the confidence parameter $1/p $. The error decays at a rate of$ O(T^{-1})$.
Comparison of Methods:
- Both explicit (CoReL-E) and implicit (CoReL-I) methods are provably effective.
- CoReL-I (MuZero-style) requires a larger "burn-in" period and has slightly worse dependence on system dimensions due to the complexity of handling quadratic regression and coordinate alignment, but it avoids learning the observation matrix $C^*$ .
Optimality: The learned policy is near-optimal. The suboptimality gap is dominated by the error in estimating the representation matrix $M$ and the latent dynamics, both of which converge as $T \to \infty$ .

5. Significance

Bridging Theory and Practice: This work provides the first rigorous theoretical justification for MuZero-style representation learning in a classical control setting (LQG). It validates the empirical success of predicting costs to learn latent dynamics.
Task-Relevant Representations: It demonstrates that learning representations specifically for cost prediction is theoretically sound and avoids the pitfalls of learning task-agnostic observation reconstruction (which can be distracted by irrelevant background information).
New Analytical Tools: The techniques developed for analyzing quadratic regression on correlated data (specifically the PoE proof for mixing processes) are novel and may be of independent interest to the broader machine learning and control theory communities.
Foundation for Nonlinear Systems: By establishing a framework for cost-driven learning in linear systems, this work lays the groundwork for extending these guarantees to more complex, nonlinear partially observable systems, which are common in robotics and real-world control.

In summary, Part II extends the cost-driven learning framework to infinite-horizon settings, proving that learning latent models via cost prediction (even implicitly, as in MuZero) is a theoretically robust strategy for solving unknown LQG control problems, provided specific alignment and excitation conditions are met.