Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation

Imagine a world where you have a team of robot vacuum cleaners. Each robot lives in a different house.

Robot A lives in a tiny, cluttered apartment with lots of furniture and cats running around.
Robot B lives in a huge, empty mansion with long hallways and no obstacles.
Robot C lives in a house with a weird, circular floor plan.

If each robot tries to learn how to clean its house alone, it will take a very long time. It has to bump into every wall and learn every corner from scratch.

If they all try to learn exactly the same way (ignoring their unique houses), they will fail. Robot A will get confused by the empty mansion, and Robot B will get lost in the clutter.

This paper proposes a "Goldilocks" solution: Personalized Team Learning.

The Big Idea: "The Shared Brain and Local Hands"

The authors suggest that while every house is different, there is a common underlying structure to how cleaning works. Maybe it's the basic physics of how a vacuum moves, or the general concept of "avoiding walls."

They propose a system where the robots share a "Shared Brain" (a common set of rules) but keep their own "Local Hands" (specific adjustments for their unique house).

The Shared Brain (Subspace): This is the part they all agree on. It learns the general "vibe" of cleaning. It's like a universal language of movement.
The Local Hands (Heads): This is the part that is unique to each robot. It learns the specific quirks of Robot A's cat or Robot B's long hallway.

How They Do It: The "Group Study Session"

The paper introduces an algorithm called PMAAR-TD. Think of it as a group study session for these robots:

The Problem: Usually, when robots share information, they get confused by "noise." Robot A's data about cats might mess up Robot B's learning about empty halls. This is called "misaligned signals."
The Solution: The algorithm uses a clever trick called Joint Linear Approximation.
- Imagine the robots are trying to draw a map.
- Instead of drawing the whole map from scratch every time, they agree to draw a skeleton (the shared subspace) together.
- Then, each robot just adds its own flesh and clothing (the local head) on top of that skeleton.
- By separating the "skeleton" from the "clothing," they can filter out the noise. If Robot A's data looks weird, the system knows it's just a "clothing" issue, not a "skeleton" issue.

Why This is a Big Deal (The Magic)

The paper proves mathematically that this method is super efficient.

Linear Speedup: If you have 10 robots, they learn 10 times faster than one robot working alone. It's like having 10 people solve a puzzle together; they finish 10 times quicker because they share the pieces they've already found.
Single-Timescale: Usually, in these systems, you have to wait for the "Shared Brain" to settle down before you can update the "Local Hands." This is slow. The authors' method updates both at the same speed, like a synchronized dance, making it much faster and more stable.

The "Secret Sauce" (Technical Metaphors)

The paper mentions some tricky math, but here's the simple version:

The "Principal Angle" Problem: Imagine trying to align two different maps. If the maps are slightly rotated, it's hard to tell if the difference is because the terrain changed or just because the map is tilted. The authors developed a way to measure this "tilt" (principal angle) and correct it instantly, ensuring the robots don't get lost in their own confusion.
The "Noise" Filter: Because the robots are in different environments (Markovian sampling), the data they get is "noisy" (like static on a radio). The algorithm acts like a high-tech noise-canceling headphone, filtering out the static so the robots can hear the true signal of how to clean.

Real-World Impact

This isn't just about vacuum cleaners. This logic applies to:

Self-driving cars: Cars in New York (traffic, pedestrians) vs. cars in rural Texas (open roads, deer). They can share the "rules of the road" but keep their "driving style" for their specific city.
Personalized Medicine: Doctors can share general knowledge about how diseases work, but tailor the treatment plan to the specific genetics of each patient.
Recommendation Systems: Netflix can learn what movies are generally popular (the shared brain) but keep a specific profile for your weird taste in horror movies (the local head).

The Bottom Line

This paper solves the dilemma of "To share or not to share?" in a world where everyone is different.

Don't share? You learn too slowly.
Share everything? You get confused and learn the wrong things.
Share the structure, keep the details? You get the best of both worlds: fast learning and personalized accuracy.

The authors have built a mathematical framework that lets a team of diverse agents collaborate without losing their individuality, proving that together, they are not just smarter, but significantly faster.

Here is a detailed technical summary of the paper "Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation."

1. Problem Statement

The paper addresses the challenge of Personalized Multi-Agent Reinforcement Learning (MARL) in heterogeneous environments.

Context: A collection of $K$ agents operates in different local environments (e.g., different household layouts for robots or traffic patterns for autonomous vehicles). Each agent $k$ has its own Markov Decision Process (MDP) with a unique transition kernel $P^k$ and reward function, but they share a common policy $\pi$ .
Goal: Agents must collaboratively learn their respective average-reward value functions ( $V^k$ ) and average rewards ( $J^k$ ) without converging to a single "average" policy that performs poorly for any specific agent.
Core Assumption: While environments differ, the optimal weight vectors $\{z_{k,*}\}$ for the agents' value functions (under a shared linear feature representation $\phi$ ) lie within a low-dimensional linear subspace of dimension $r$ (where $r \ll d$ , and $d$ is the feature dimension).
Challenge: Standard multi-agent RL often fails here due to "misaligned" learning signals caused by environmental heterogeneity. Conversely, training agents independently ignores shared structures, leading to redundant computation and poor sample efficiency. The paper specifically tackles the average-reward setting (which is more challenging than discounted settings due to the lack of a discount factor to bound variance) under Markovian sampling (non-i.i.d. data).

2. Methodology: PMAAR-TD

The authors propose PMAAR-TD (Personalized Multi-Agent Average Reward TD-learning), a cooperative algorithm based on joint linear approximation.

Algorithmic Structure

The value function for agent $k$ is approximated as $V^k(s) \approx \phi(s)^\top B \omega_k$ , where:

$B \in \mathbb{R}^{d \times r}$ is a common subspace matrix shared by all agents.
$\omega_k \in \mathbb{R}^r$ is an agent-specific head (local parameter).

The algorithm operates in a single-timescale setting (unlike many prior works that use two-timescale updates), updating $B$ and $\omega_k$ simultaneously.

Key Algorithmic Components

Local TD(L) Updates: Each agent performs $L$ steps of local TD updates to estimate the value function and average reward.
Joint Estimation: Agents update their local heads ( $\omega_k$ $ω_{k}$ ) and the global subspace ( $B$ $B$ ) using the TD error $\delta_{t,L}^k$ $δ_{t, L}^{k}$ .
- Local Head Update: $\omega_k$ is updated via a perturbed stochastic gradient step, followed by a projection onto a convex ball to ensure boundedness.
- Subspace Update: The global matrix $B$ is updated using a projected innovation. Crucially, the update direction is projected onto the orthogonal complement of the current subspace ( $B_{t,\perp}$ ) to prevent the subspace from collapsing or drifting incorrectly.
QR Decomposition: After aggregating updates from all agents, a QR decomposition is applied to $B_{t+1}$ to enforce orthonormality. This structural constraint is vital for controlling the principal angle distance between the estimated subspace and the true optimal subspace.
Reward Estimation: Each agent independently updates its local average reward estimator $\eta_k$ .

3. Key Contributions & Technical Innovations

A. Convergence Guarantees with Linear Speedup

The paper establishes finite-time convergence rates for the joint estimation errors of the subspace and local heads.

Rate: The error decays at a rate of $\tilde{O}\left(\frac{1}{\sqrt{TK}}\right)$ , where $T$ is the number of iterations and $K$ is the number of agents.
Significance: This demonstrates a linear speedup with respect to the number of agents, meaning the sample complexity per agent decreases as $1/K$.
Single-Timescale: Unlike previous works (e.g., Xiong et al., 2025) that rely on two-timescale dynamics (separating the speed of subspace and head updates), this work achieves convergence under a single-timescale setting. This is more practical and robust, avoiding the need for strict separation of step sizes.

B. Handling Markovian Sampling and Heterogeneity

The analysis overcomes significant technical hurdles:

Markovian Noise: The authors handle the non-i.i.d. nature of TD learning data (Markovian sampling) without assuming i.i.d. samples.
Error Coupling: In heterogeneous settings, the error dynamics of the subspace ( $B$ ) and local heads ( $\omega_k$ ) are tightly coupled. The authors develop a unified Lyapunov function to analyze these coupled errors.
Principal Angle Analysis: A major technical novelty is the derivation of a lower bound on local head errors based on the principal angle distance between the estimated and true subspaces. They prove that if the subspace is misaligned, the local heads cannot converge arbitrarily fast, effectively linking the two error sources.

C. Comparison with Existing Methods

The paper distinguishes itself from:

Standard Federated RL: Which often trains a single common policy (failing in high heterogeneity).
Two-Timescale PFL: Which requires complex step-size separation assumptions that may not hold for standard polynomial step sizes.
Independent Learning: Which ignores shared structures and suffers from high sample complexity.

4. Experimental Results

The authors validate PMAAR-TD on both prediction and control tasks (Acrobot and CartPole environments).

Prediction Tasks (Fixed Policy):
- PMAAR-TD significantly outperforms Single TD (independent learning) in convergence speed.
- It avoids the suboptimal convergence of FedTD-Uniform (which averages all parameters), demonstrating superior adaptability to heterogeneous environments.
Control Tasks (Actor-Critic):
- In a "mirrored environment" setup (where optimal actions are adversarial between agents), PMAAR-TD successfully learns personalized policies while leveraging shared features.
- Convergence Speed: PMAAR-TD converges faster than Single AC and FedAC-Uniform.
- Stability: The method exhibits lower variance across different random seeds compared to baselines.
- Single vs. Two-Timescale: Experiments confirm that the proposed single-timescale approach converges significantly faster than a two-timescale baseline.

5. Significance and Impact

Theoretical Advancement: This work provides one of the first rigorous finite-time convergence analyses for personalized MARL in the average-reward setting under Markovian sampling. It resolves the "curse of multi-agents" in heterogeneous settings by leveraging shared low-dimensional structures.
Practical Applicability: The single-timescale design makes the algorithm easier to implement and tune compared to two-timescale methods, which are sensitive to step-size ratios.
Generalization: The analytical techniques (specifically the handling of coupled heterogeneous dynamics and principal angle distances) offer a framework that can inspire future research in multi-task learning and federated reinforcement learning where data heterogeneity is a primary concern.

In summary, the paper successfully bridges the gap between personalized federated learning and multi-agent reinforcement learning, proving that agents can collaboratively learn distinct optimal policies efficiently by exploiting a shared underlying linear structure, even in highly heterogeneous and non-stationary environments.