Guided Policy Optimization under Partial Observability

The Big Problem: Learning in the Fog

Imagine you are trying to learn how to drive a car.

The Ideal Scenario: You have a perfect instructor sitting next to you who can see the entire road, the weather, the traffic lights, and the engine temperature. They tell you exactly what to do.
The Real Problem: In the real world (and in many AI tasks), you are driving in a thick fog. You can only see a few feet ahead. You can't see the traffic light turning red until you are right on top of it. You can't feel the engine overheating.

In the world of Artificial Intelligence, this is called Partial Observability. The AI (the "Student") has to make decisions without seeing the whole picture. This makes learning incredibly hard, slow, and prone to mistakes.

The Old Way: The "Super-Teacher" Trap

Traditionally, to help the Student learn, researchers would train a "Teacher" AI that could see everything (no fog). Then, they would try to teach the Student to copy the Teacher.

The Analogy: Imagine a Grandmaster Chess player (the Teacher) trying to teach a beginner (the Student).

The Grandmaster sees 10 moves ahead.
The beginner can only see 1 move ahead.
If the Grandmaster says, "Move the Knight here," the beginner has no idea why. To the beginner, it looks like a random, crazy move.
The Result: The beginner gets confused, tries to copy the move, fails, and learns nothing. This is called the "Imitation Gap." The Teacher is too good, and the Student can't understand the logic behind the moves.

The New Solution: Guided Policy Optimization (GPO)

The authors of this paper propose a new method called Guided Policy Optimization (GPO). Instead of a distant, perfect Grandmaster, they create a Coach and a Player who train together in the same room.

Here is how it works, step-by-step:

1. The "Coach" (The Guider)

The Coach has a special pair of glasses that let them see through the fog (they have "privileged information"). They know where the tiger is hiding behind the door, or they know the engine is overheating.

The Twist: The Coach isn't allowed to just be a Grandmaster. They are forced to stay within the Student's "comfort zone."

2. The "Player" (The Learner)

The Player is the one actually driving the car or playing the game. They are still in the fog. They can only see what is right in front of them.

3. The "Backtracking" Rule (The Secret Sauce)

This is the most important part. In old methods, the Teacher would just show the Student the perfect move, even if the Student couldn't do it.
In GPO, there is a strict rule: The Coach must never get so far ahead that the Player can't follow.

The Analogy: Imagine the Coach is walking a dog on a leash.
- If the Coach runs too fast, the leash goes taut, and the dog trips.
- If the Coach stops completely, the dog gets bored.
- GPO's Strategy: The Coach walks just a little bit faster than the dog. If the dog starts to lag behind, the Coach immediately slows down (this is called Backtracking) to match the dog's pace.

4. The Learning Loop

The Coach uses their super-vision to figure out the best path forward.
The Player tries to copy the Coach's moves, but only using the limited view they have.
The Check: If the Player struggles to copy the Coach, the system realizes the Coach is moving too fast.
The Correction: The Coach "backtracks" and adjusts their behavior to be something the Player can actually learn.
Repeat: They do this over and over. The Coach gets slightly better, and the Player gets slightly better, always staying in sync.

Why is this better?

No "Impossible" Teachers: Because the Coach is forced to stay close to the Player's ability, the Player never gets confused by moves they can't understand.
Efficient Learning: The Coach uses their super-vision to find the right direction, so the Player doesn't waste time wandering in the fog making random mistakes.
Robustness: Even if the Player is noisy or makes mistakes (like a real robot with shaky sensors), the system adapts. The Coach pulls them back on track without breaking the connection.

Real-World Examples from the Paper

The authors tested this on three types of challenges:

The "Tiger Door" Game: Imagine two doors. A tiger is behind one. You can listen to find out where it is, or just guess.
- Old Way: A perfect Teacher knows where the tiger is and just opens the right door. The Student, who can't hear the tiger, just guesses randomly and fails.
- GPO Way: The Coach realizes the Student can't hear the tiger, so the Coach also chooses to listen first. They teach the Student the strategy of "Listen, then Open," which the Student can actually learn.
Robotics (The Foggy Gym): They trained robots to walk (like a Humanoid or a Cheetah) while adding "noise" to their sensors (simulating fog).
- The GPO robots learned to walk much faster and more stably than robots trained with old methods. They learned to balance even when their "eyes" were blurry.
Memory Games: Some tasks require remembering things from the past (like a card game).
- GPO helped the AI remember the right cards to play, even when it couldn't see the whole board, by guiding it through the "fog" of memory.

The Takeaway

Guided Policy Optimization is like having a running partner who is just slightly faster than you. They know the route better than you do, but they slow down to match your pace so you don't get left behind. By constantly adjusting their speed to ensure you can follow, they help you get to the finish line faster and more safely than if you were running alone or trying to keep up with a professional Olympian.

It solves the problem of "How do I teach a beginner using an expert's knowledge without confusing them?" by ensuring the expert becomes a beginner-friendly guide.

1. Problem Statement

The paper addresses the challenge of Reinforcement Learning (RL) in Partially Observable Markov Decision Processes (POMDPs). In real-world scenarios, agents often operate with limited or noisy observations ( $o_t$ ), while training environments (like simulations) may provide privileged information (full state $s_t$ ).

Existing approaches to leverage this privileged information typically rely on Teacher-Student Learning (TSL) or Imitation Learning (IL), where a "teacher" trained with full observability guides a "student" with partial observability. However, these methods suffer from two critical issues:

The "Impossibly Good" Teacher / Imitation Gap: A teacher trained with full state information may learn a policy that is impossible for the student to imitate because the student lacks the necessary information to make the same decisions (e.g., the teacher knows a tiger's location immediately, while the student must listen first).
Inefficient Use of Privileged Information: Current hybrid methods (combining RL and IL) often switch to pure RL when the teacher becomes "inimitable," effectively discarding the valuable privileged information, or they use indirect reward shaping that fails to provide strong theoretical guarantees.

2. Methodology: Guided Policy Optimization (GPO)

The authors propose Guided Policy Optimization (GPO), a framework that co-trains a Guider (teacher) and a Learner (student) simultaneously. Unlike traditional TSL where the teacher is fixed or pre-trained, GPO ensures the guider's policy remains within the imitable region of the learner.

Core Mechanism

GPO operates through an iterative four-step process:

Data Collection: Trajectories are collected using the guider's policy ( $\mu$ ), which has access to privileged information ( $s$ ).
Guider Training: The guider is updated using a standard RL objective (e.g., PPO) to maximize rewards.
Learner Training: The learner ( $\pi$ ), which only sees partial observations ( $o$ ), is trained to minimize the distance (KL divergence) to the guider's policy.
Guider Backtracking: Crucially, the guider's policy is constrained to stay close to the learner's current policy. If the guider drifts too far ahead (becoming inimitable), it is "backtracked" toward the learner.

Theoretical Foundation

The paper proves (Proposition 1) that if the guider is updated via policy mirror descent and the learner mimics it with backtracking, the learner's update is mathematically equivalent to a constrained policy mirror descent. This implies that the learner can achieve optimality comparable to direct RL training, despite never interacting with the environment directly during the imitation phase.

Two Variants

The authors implement GPO using two specific loss functions:

GPO-Penalty: Uses a KL-divergence penalty term to constrain the guider. It dynamically adjusts a coefficient $\alpha$ based on the distance between the guider and learner. It also incorporates an auxiliary RL loss for the learner to accelerate convergence.
GPO-Clip: Inspired by PPO-clip, this variant uses a double-clipping mechanism. It introduces an inner clip to prevent the guider from moving too far away from the learner (halting updates if the advantage is positive but the policy is too far ahead) and a mask on the backtracking loss to only penalize when the policies diverge beyond a threshold $\delta$ . This variant often shares a single neural network for both guider and learner, distinguishing inputs via a flag vector.

3. Key Contributions

Novel Framework: Introduction of GPO, which solves the "impossibly good teacher" problem by co-training the guider and learner, ensuring the guider remains "possibly good" (imitable).
Theoretical Guarantees: Theoretical proof that GPO allows the learner to achieve optimality comparable to direct RL, mitigating the suboptimality inherent in standard imitation learning from privileged teachers.
Variance Reduction: By splitting the learning process, the guider handles complex RL gradients using full information, while the learner learns via stable supervised learning, reducing the high variance often associated with RL under partial observability.
Robust Implementation: The development of GPO-penalty and GPO-clip variants that effectively balance exploration (guider) and imitation (learner) without requiring separate pre-training phases.

4. Experimental Results

The authors evaluated GPO across three distinct domains:

Didactic Tasks (TigerDoor):
- In the classic TigerDoor problem, standard imitation learning fails because the teacher never chooses the "listen" action.
- Result: GPO variants (even GPO-naive without RL) achieved optimal performance, whereas direct cloning failed. This validates the theoretical claim that backtracking allows the learner to discover necessary exploratory actions.
Continuous Control (Brax Domain):
- Tasks involved removing velocity information and adding Gaussian noise to observations.
- Result: GPO-clip and GPO-penalty significantly outperformed all baselines (including PPO-asym, ADVISOR, A2D, and pre-trained teacher methods).
- Key Finding: Methods relying on pre-trained teachers failed as noise increased because the teacher became inimitable. GPO maintained robustness because the guider was dynamically constrained to the learner's capabilities.
Memory-Based Tasks (POPGym):
- Tasks required agents to recall past observations (e.g., Battleship, Count Recall).
- Result: GPO showed consistent improvements over PPO and PPO-asym. The ability of the guider to explore further while staying within the imitable region proved crucial for memory-intensive tasks.

5. Significance and Impact

Bridging the Imitation Gap: GPO provides a principled solution to the long-standing problem of transferring policies from full-observability to partial-observability settings, a common bottleneck in Sim-to-Real transfer.
Efficiency: It eliminates the need for expensive pre-training of teachers or complex reward-shaping heuristics. The co-training approach utilizes privileged information more efficiently than switching between RL and IL.
Scalability: The method is computationally efficient (only ~10-20% slower than standard PPO) and does not require additional network architectures beyond standard actor-critic setups.
Future Applications: The framework is highly relevant for robotics, autonomous driving, and multi-agent systems where agents may have access to global state during training (simulation) but only local sensors during deployment.

In summary, GPO represents a significant advancement in POMDP learning by unifying the strengths of RL (optimality) and IL (sample efficiency) while dynamically managing the information asymmetry between the teacher and student.