Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Imagine you are teaching two robots, Pink and Green, to play a game of "Tic-Tac-Toe" on a giant 10x10 grid. The goal is simple: paint as many tiles as possible before time runs out. If you paint a tile, you own it. If you "lock" a tile, no one else can take it back. The robot with the most tiles at the end wins.

The author of this paper, a high school student named Diyansha, wanted to see if a standard AI training method called PPO (Proximal Policy Optimization) could teach these robots to win.

Here is the story of what went wrong, why it was so confusing, and how a tiny fix saved the day.

1. The Disaster: Robots That Can't Win

Diyansha set up the game and let the robots train against each other for a long time (84,000 rounds). The expectation was that they would get smarter and smarter.

The Reality? They got worse.
After all that training, the robots could only beat a random opponent (one that just moves aimlessly) 26% of the time. That's worse than flipping a coin!

Why? It turns out the "teacher" (the code) was giving the robots terrible instructions. It was like trying to teach a child to drive while the speedometer was broken and the steering wheel was disconnected.

2. The Five "Bugs" (The Broken Teacher)

Diyansha found five specific mistakes in the code that made the robots stupid. Think of these as five broken parts of a car engine:

The "Explosive" Score: The reward system was broken. If a robot locked a tile, the computer gave it a reward so huge (like +10,000 points) that it drowned out everything else. It was like a teacher screaming "GOOD JOB!" for tying their shoes, but ignoring when the student actually solved a math problem. The robots stopped learning strategy and just tried to get that one huge number.
The "Amnesia" Problem: The robots were bad at remembering what happened at the start of the game. In a 250-step game, a move made at step 1 is worth almost nothing by step 250 if you don't use a special math trick (called GAE). Without this, the robots forgot their early mistakes and couldn't learn from them.
The "Blind" Signal: The robots didn't know when the game was over or who won. They were playing in the dark, never getting a "You Win!" or "You Lose!" signal at the end.
The "Confused" Eyes: The robots were fed data in a messy way. Some numbers were small (like position 1-9), and some were huge (like steps remaining 0-250). It's like trying to read a map where the distance to the next town is written in millimeters, but the name of the town is written in giant letters. The robots got confused and focused on the wrong things.
The "Wrong Winner": The computer was declaring the winner based on the broken score (see #1) instead of actually counting the tiles.

The Fix: Once these five bugs were squashed, the robots suddenly became very good. They beat the random opponent 73.5% of the time. Success!

3. The Hidden Trap: "Competitive Overfitting"

But then, something weird happened. The author kept training them for a few thousand more rounds.

The robots got worse again.
Their win rate against the random opponent crashed from 73.5% down to 21.6%.

Why? This is the paper's biggest discovery, called Competitive Overfitting.
Imagine Pink and Green are two boxers training in a gym.

Phase 1: They spar with each other. Pink learns a specific punch to counter Green's specific dodge. Green learns a specific dodge to counter Pink's punch.
Phase 2: They get so good at countering each other that they forget how to fight anyone else. They become "specialists" in fighting each other.
The Trap: If you put them in a ring with a random stranger (the random opponent), they lose because they are trying to use their "special counter-punch" against someone who isn't even doing the move they expect.

The Scary Part: The author checked the robots' performance against each other (Self-Play), and it looked perfect! They were still winning 50/50 against each other. The "Self-Play Score" was a lie. It hid the fact that they had become useless against the real world.

4. The Magic Fix: "Opponent Mixing"

How do you stop two boxers from becoming too specialized? You make them spar with someone else.

The author introduced a tiny change: 20% of the time, Green stops trying to be smart and just moves randomly.

Pink still has to fight the smart Green sometimes.
But 20% of the time, Pink has to fight a "drunk" Green who moves randomly.

This forced Pink to learn a general strategy that works against everyone, not just the specific tricks of its partner.

The Result:

Before the fix: 21.6% win rate (Terrible).
After the fix: 77.1% win rate (Excellent).
And the best single robot? It hit 93.9%.

5. The Big Lessons

This paper teaches us three main things about training AI:

Don't trust the mirror: If you train two AIs against each other, they might just get really good at beating each other while getting terrible at everything else. You must test them against a "dummy" opponent to see if they are actually smart.
Small details matter: A tiny math error in how you count points or normalize numbers can make a super-intelligent AI act like a toddler.
Variety is key: To make an AI robust, it needs to face different types of opponents, not just one perfect partner.

In a nutshell: The author built a game, broke the AI with five silly bugs, fixed them, discovered a new way the AI could "cheat" by over-specializing, and fixed that with a simple rule: "Sometimes, just play randomly."

The code is now open for everyone to use, so other researchers can learn from these mistakes and build better, more reliable AI.

1. Problem Statement

The paper addresses a critical gap in Competitive Multi-Agent Reinforcement Learning (MARL): the difficulty of diagnosing why standard single-agent algorithms (like Proximal Policy Optimization, PPO) fail when transplanted into competitive settings. While self-play is known to produce superhuman performance in complex games (e.g., Go, StarCraft II), practitioners often observe that agents fail to learn even simple competitive tasks.

The specific problem investigated is the systematic failure of PPO in a symmetric, zero-sum grid game. The authors demonstrate that a naive implementation can result in an agent performing worse than random chance (26.8% win rate) and identify that standard self-play metrics often mask a phenomenon called competitive overfitting, where agents co-adapt to each other so specifically that they lose generalization capabilities entirely.

2. Methodology

The Environment: Territory Paint Wars

To isolate failure modes, the authors built Territory Paint Wars, a minimal, deterministic, zero-sum environment implemented in Unity with a custom Python-TCP bridge.

Gameplay: Two agents (Pink and Green) act on a $10 \times 10$ grid for 250 steps.
Actions: Move (N, S, E, W) or Lock the current tile.
Mechanic: "Locking" is irreversible; a locked tile cannot be reclaimed. The agent with the most tiles at the end wins.
Observation Space: 206-dimensional vector including positions, board ownership (100 tiles), lock masks (100 tiles), and steps remaining.
Reward Structure:
- $+0.1$ per net tile gained.
- $+0.5$ per new lock.
- $\pm 1.0$ terminal reward for win/loss.

Experimental Protocol

The study followed a rigorous ablation and diagnostic approach:

Baseline (v1): A "buggy" PPO implementation with known implementation errors.
Iterative Fixes (v2): Systematically correcting implementation bugs to reach peak performance.
Diagnosis of Collapse: Observing performance degradation after the baseline was fixed.
Mitigation (v3): Introducing Opponent Mixing to solve the collapse.
Ablation Study: Removing individual fixes to determine their specific contribution to performance.
Evaluation: Agents were evaluated against a uniformly-random opponent (not just self-play) across 10 random seeds to measure true generalization.

3. Key Findings and Contributions

The paper identifies six distinct failure modes: five implementation bugs and one emergent pathological behavior.

A. Five Implementation Failure Modes (v1 Baseline)

The initial baseline performed at 26.8% (worse than random 50%). The authors identified and fixed five compounding bugs:

Reward-Scale Imbalance: A bug caused cumulative lock bonuses to explode (reaching $\pm 10,000$ $\pm 10, 000$ ), drowning out the terminal win/loss signal.
- Fix: Changed reward to be based on new locks per step, keeping returns in a manageable range ( $+5$ to $+30$ ).
Missing Terminal Signal: The original agent received no reward for winning or losing, only for tile changes.
- Fix: Added a $\pm 1.0$ terminal bonus.
Ineffective Long-Horizon Credit Assignment: Using plain Monte Carlo (MC) returns with $\gamma=0.99$ $γ = 0.99$ over 250 steps discounted early actions by a factor of $\approx 0.08$ $\approx 0.08$ , rendering early strategic decisions invisible to the gradient.
- Fix: Implemented Generalized Advantage Estimation (GAE) with $\lambda=0.95$ .
Unnormalised Observations: Raw inputs (positions 0–9, steps 0–250) were mixed without scaling, causing numerical dominance of step-count features.
- Fix: Scaled all inputs to $[0, 1]$ .
Incorrect Win Detection: The winner was determined by comparing cumulative rewards (which were corrupted by the scale bug) rather than counting tiles.
- Fix: Determined the winner by counting tiles from the board observation.

B. The Emergent Pathology: Competitive Overfitting

Even after fixing the five bugs, the authors observed a new failure mode:

The Phenomenon: At episode 8,000, the agent achieved a 73.5% win rate against a random opponent. However, by episode 12,000, this collapsed to 21.6%.
The Cause: In pure self-play, both agents co-adapt. They develop highly specialized counter-strategies to each other's current policies, creating a narrow Nash Equilibrium that fails against any out-of-distribution opponent (like a random agent).
The Monitoring Failure: Standard self-play win rates remained near 50% throughout the collapse because both agents degraded equally. This makes the failure invisible without periodic external evaluation.

C. The Solution: Opponent Mixing

To mitigate competitive overfitting, the authors introduced a minimal intervention:

Mechanism: In 20% of training episodes ( $\epsilon = 0.2$ ), the opponent is replaced by a uniformly-random policy instead of the co-adaptive agent.
Result: This forces the agent to maintain generalizable strategies.
Performance: Generalization win rate recovered from 21.6% to 77.1% (mean over 10 seeds), with the best seed reaching 93.9%.

4. Ablation Study Results

The authors isolated the contribution of each fix. The results highlight a critical asymmetry in importance:

Component Removed	Win Rate vs. Random	Impact
GAE (Use MC instead)	9.6%	Catastrophic. Early decisions lose gradient signal.
Observation Normalization	12.6%	Catastrophic. Accelerates overfitting; agent learns a "harmful" policy.
Opponent Mixing	21.6%	Catastrophic. Leads to competitive overfitting collapse.
Terminal Reward	87.1%	Benign. Dense step rewards are sufficient for learning if other fixes are present.
v1 Baseline (All Bugs)	26.8%	Broken state.

Key Insight: GAE, Observation Normalization, and Opponent Mixing are individually critical; removing any one causes performance to drop below the broken baseline. Terminal reward is complementary but not strictly necessary when dense step rewards are present.

5. Significance and Conclusion

Reproducibility: The paper provides Territory Paint Wars as an open-source benchmark with a lightweight Unity-Python interface, removing dependencies on heavy frameworks like ML-Agents.
Diagnostic Framework: It establishes that self-play win rates are insufficient for monitoring competitive MARL. Periodic evaluation against a fixed external baseline (e.g., random agent) is essential to detect competitive overfitting.
Minimal Intervention: The solution to competitive overfitting (opponent mixing) is computationally cheap (one line of code) and does not require population-based training or complex league systems.
Credit Assignment: The study empirically proves that in long-horizon competitive games, GAE is not just an optimization but a prerequisite for learning, as standard MC returns fail to propagate gradients for early-game strategic decisions.

In summary, the paper demonstrates that robust competitive MARL requires not just algorithmic tuning, but careful attention to reward scaling, observation normalization, and, crucially, maintaining opponent diversity to prevent agents from over-specializing against each other.

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

1. The Disaster: Robots That Can't Win

2. The Five "Bugs" (The Broken Teacher)

3. The Hidden Trap: "Competitive Overfitting"

4. The Magic Fix: "Opponent Mixing"

5. The Big Lessons

1. Problem Statement

2. Methodology

The Environment: Territory Paint Wars

Experimental Protocol

3. Key Findings and Contributions

A. Five Implementation Failure Modes (v1 Baseline)

B. The Emergent Pathology: Competitive Overfitting

C. The Solution: Opponent Mixing

4. Ablation Study Results

5. Significance and Conclusion

More like this

A Theory-guided Weighted L2L^2L2 Loss for solving the BGK model via Physics-informed neural networks

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems

A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks