Reinforcing the World's Edge: A Continual Learning Problem in the Multi-Agent-World Boundary

The Big Idea: Where Do "You" End and "The World" Begin?

Imagine you are playing a video game. To learn how to win, you need to figure out which moves work. In a standard video game (a single-player game), the rules are fixed. The enemies don't learn; the doors don't change their locks. If you find a secret path that leads to the treasure, that path will work every single time you play. You can memorize it, and it becomes a permanent part of your "winning strategy."

The Paper's Core Question: What happens when the "world" you are playing in isn't static? What happens if the world is actually another player who is also learning and changing their mind?

This paper argues that when you play with other learning agents (like in a multiplayer game), the line between "You" and "The World" gets blurry and starts to drift. This makes it incredibly hard to keep using the same winning strategies from one game to the next.

The Analogy: The "Winning Recipe"

To understand the paper, let's use the analogy of a Recipe.

1. The Single-Agent World (The Static Kitchen)

Imagine you are a chef in a kitchen where the ingredients and the oven never change.

The Goal: Bake a perfect cake.
The "Invariant Core": After baking 100 cakes, you realize that every single successful cake required three specific steps: Mix Flour → Add Eggs → Bake.
The Result: You write these three steps down as your "Invariant Core." No matter how many times you bake, this recipe works. You can reuse it forever because the kitchen (the world) is stable.

2. The Multi-Agent World (The Cooperative Kitchen)

Now, imagine you are baking in a kitchen with a partner. You are the "Focal Agent," and your partner is the "Peer Agent."

The Twist: Your partner is also learning.
- Episode 1: Your partner is slow. To get the cake done, you must mix the flour yourself. The "Invariant Core" for this game is: You Mix → You Add Eggs → You Bake.
- Episode 2: Your partner gets faster and smarter. Now, they decide to mix the flour for you! Suddenly, the step "You Mix" is no longer necessary for a successful cake. In fact, if you try to mix it, you might get in their way.
- The Result: The "Invariant Core" from Episode 1 (Mixing) has vanished. The recipe that worked yesterday is useless today.

The Problem: "Boundary Drift"

The paper calls this phenomenon "Boundary Drift."

In a single-player game, the boundary between "Me" and "The World" is a solid wall. The wall doesn't move.
In a multi-player game, the wall is made of jelly. As your partner learns new tricks, the wall shifts.

What used to be "My Job" (mixing) becomes "Their Job."
What used to be "The World's Behavior" (the oven heating up) changes because your partner is now controlling the oven.

Because the boundary keeps moving, the "winning recipe" (the invariant core) keeps changing or disappearing entirely. You can't just memorize a strategy and reuse it; you have to constantly re-learn what the world looks like right now.

The Paper's Contributions (In Plain English)

The authors did three main things to explain this:

They Proved the "Recipe" Exists (When the World is Stable):
They mathematically showed that in a fixed world, there is always a set of common steps shared by every successful attempt. If you find a key to a door, that "find key" step will always be part of the solution.
They Showed the "Recipe" Can Vanish (When the World Moves):
They proved that in a multi-agent game, a step that was essential in one game might be completely gone in the next. If your partner learns to open the door themselves, your "open door" step disappears from the list of necessary actions. The "core" of your strategy shrinks or disappears.
They Measured the "Drift":
They created a way to measure how much the "world" changed between games. They call this the Variation Budget.
- Low Budget: The world barely changed. Your old recipe still mostly works.
- High Budget: The world changed drastically. Your old recipe is garbage; you need a new one.

Why Does This Matter?

This paper changes how we think about Artificial Intelligence in teams.

Old Way: We thought AI agents just needed to adapt to new tasks (like switching from chess to checkers).
New Way: We need to realize that even if the task (the game) stays the same, the environment changes because the other players are learning.

The Takeaway:
If you want to build AI that works well in teams, you can't just teach it a fixed set of rules. You have to teach it to:

Detect the Drift: Notice when the "boundary" has moved (e.g., "Hey, my partner is doing my job now!").
Preserve the Core: Find the tiny bits of the strategy that never change (like "I still need to reach the goal").
Predict the Shift: Try to guess what the partner will do next so you don't waste time on steps that are no longer needed.

In short: In a world of learning partners, the only constant is change. The "edge" of your world is always shifting, and your AI needs to learn how to surf those waves rather than trying to stand still.

1. Problem Statement

The paper addresses a fundamental instability in Reinforcement Learning (RL) and Multi-Agent RL (MARL): the fragility of reusable decision structures (prototypes) when the agent–world boundary shifts.

The Core Issue: In standard single-agent RL, successful trajectories often share common subsequences of state–action pairs (e.g., "find key" $\to$ "open door"). These form an invariant core that allows for knowledge transfer across episodes.
The Challenge in MARL: In decentralized multi-agent settings, the "world" includes other learning agents. As peer agents update their policies, they alter the effective transition dynamics and rewards for the focal agent.
The Consequence: This creates endogenous non-stationarity. Even if the underlying task (the rules of the game) remains unchanged, the effective MDP seen by the focal agent drifts. Consequently, the "invariant core" of successful strategies can shrink or vanish entirely between episodes, causing transfer learning to fail. The paper frames this not just as a non-stationarity problem, but as a Continual Learning (CRL) problem driven by the instability of the agent–world boundary itself.

2. Methodology and Formalism

The paper employs a formal analysis of trajectory structures and MDP drift to quantify this phenomenon.

A. Trajectory Tries and the Invariant Core

Representation: Successful trajectories are modeled as paths in a trajectory trie (a prefix tree over state–action pairs).
Definition of Core: The Invariant Core is defined as the set of $\preceq$ $⪯$ -maximal subsequences common to all successful trajectories ( $\mathcal{S}$ $S$ ).
- Let $\preceq$ denote the subsequence relation.
- $Core_\phi(\mathcal{S}) = \max_{\preceq} \{ u \in \Sigma^{\le H} : \forall \tau \in \mathcal{S}, u \preceq \phi(\tau) \}$ , where $\phi$ is an optional abstraction function (e.g., mapping raw states to "options" or skills).
Theoretical Guarantee (Single-Agent): In a stationary, finite-horizon MDP with a unique absorbing goal, the paper proves (Theorem 2.1) that a non-empty invariant core always exists. This core is policy-independent because the environment dynamics $(P, R)$ are exogenous and fixed.

B. The Drifting Boundary in Decentralized MARL

Modeling Peers: In a decentralized Markov game, the focal agent views the peer as part of the environment. The effective transition kernel $P_e$ and reward $R_e$ at episode $e$ are induced by the peer's current policy $\pi^e_2$ :
$P_e(s' | s, a_1) = \sum_{a_2} P(s' | s, a_1, a_2) \pi^e_2(a_2 | s)$
Policy-Induced Drift: As the peer updates its policy ( $\pi^e_2 \to \pi^{e+1}_2$ ), the induced MDP $M_e$ changes.
Core Instability: The paper demonstrates (Proposition 2.1) that a prototype $u$ present in $Core_\phi(\mathcal{S}_e)$ may not exist in $Core_\phi(\mathcal{S}_{e+1})$ . In extreme cases, the intersection of cores across episodes can be empty (excluding trivial terminal symbols), meaning no reusable structure survives.

C. Quantifying Drift: The Variation Budget

To measure the severity of this boundary instability, the paper introduces a Variation Budget ( $V_E$ ) over the sequence of induced MDPs:
$V_E = \sum_{e=2}^E \left( \sup_{s, a_1} \sum_{s'} |P_e(s'|s, a_1) - P_{e-1}(s'|s, a_1)| + \sup_{s, a_1} |R_e(s, a_1) - R_{e-1}(s, a_1)| \right)$

$V_E = 0$ implies a stationary environment where the core is stable.
$V_E > 0$ quantifies the magnitude of the boundary shift, directly linking the drift in peer policies to the potential loss of invariant prototypes.

3. Key Contributions

Formalization of the Invariant Core: The paper defines the "invariant core" as the set of maximal common subsequences in successful trajectories, proving its existence in stationary single-agent MDPs under mild goal-conditioned assumptions.
Identification of Boundary-Driven Continual Learning: It reframes decentralized MARL as a continual learning problem where the primary source of non-stationarity is the drifting agent–world boundary caused by peer policy updates, rather than exogenous task switches.
Proof of Core Erosion: It provides a theoretical argument and sketch showing that in decentralized games, the invariant core is not stable across episodes. Prototypes can vanish entirely if peer agents change how they solve subgoals.
Quantitative Metric for Drift: The introduction of the Variation Budget ( $V_E$ ) provides a concrete metric to bound the instability of the induced MDP sequence, linking the magnitude of peer policy changes to the loss of transferable structure.
Theoretical Bridge: The work connects standard RL theory (MDPs, trajectories) with Continual RL and MARL, suggesting that "agency" is a modeling choice that dictates the stability of learning.

4. Results and Findings

Single-Agent Stability: In stationary settings, the invariant core is robust. If a policy collects a complete set of successful trajectories, the resulting core is identical regardless of the specific policy used, provided the environment $(P, R)$ is fixed.
Multi-Agent Instability: In decentralized settings, even minor updates to a peer's policy can alter the set of successful trajectories.
- Example: In a cooperative key-door task, if a peer learns to pick up a key independently, the focal agent no longer needs to execute the "drop key for peer" prototype. This prototype, previously part of the invariant core, disappears from the new core.
Intersection Collapse: The intersection of invariant cores across consecutive episodes ( $Core(\mathcal{S}_e) \cap Core(\mathcal{S}_{e+1})$ ) can reduce to only the policy-independent individual task core, or even the empty set, rendering standard transfer learning ineffective.

5. Significance and Future Directions

Reframing MARL: The paper argues that decentralized MARL should be treated as a Continual Learning problem rooted in boundary instability. This shifts the focus from merely adapting to changing rewards to managing the stability of the agent–world interface.
Implications for Transfer Learning: It explains why transfer often fails in multi-agent settings even when the task is static: the "world" the agent learned to navigate no longer exists in the same form.
Future Work: The authors propose several directions:
1. Preservation: Developing mechanisms (e.g., robust options or deviation strategies) that remain valid despite small variation budgets ( $V_E$ ).
2. Prediction: Using opponent modeling or recursive reasoning to predict boundary shifts, allowing the agent to anticipate and adapt to the loss of specific prototypes.
3. Algorithms: Designing algorithms with theoretical guarantees that scale with $V_E$ and benchmarks that explicitly vary the agent–world boundary.

In summary, this paper provides a rigorous theoretical foundation for understanding how the definition of the agent (the boundary) dictates the stability of learning in multi-agent systems, offering a new lens through which to view and solve the challenges of non-stationarity in MARL.