AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification

Imagine you are teaching a robot to navigate a giant, dark maze. You want the robot to learn two very different things at the same time:

Exploration: It needs to wander everywhere to make sure it doesn't miss any hidden corners (like a curious child).
Skill Diversity: It needs to learn distinct "moves" or "skills" (like walking, jumping, or turning) that are clearly different from one another, so it can pick the right move for a specific puzzle later.

The problem is that these two goals often fight each other. If the robot focuses too much on being curious and wandering randomly, it might never learn a specific, useful skill. If it focuses too much on mastering distinct skills, it might get stuck in one corner and never explore the rest of the maze.

This paper introduces a new method called AMPED (Adaptive Multi-objective Projection for balancing Exploration and skill Diversification) to solve this tug-of-war. Here is how it works, using simple analogies:

1. The "Gradient Surgery" (The Traffic Cop)

In machine learning, the robot learns by adjusting its brain based on "gradients" (signals that tell it which way to move to get better).

The Problem: The signal for "go explore" (wander randomly) and the signal for "learn diverse skills" (stay distinct) often point in opposite directions. It's like a driver getting two GPS instructions at once: "Turn Left!" and "Turn Right!" If you try to follow both, you just spin in circles or crash.
The AMPED Solution: The authors use a technique called Gradient Surgery (specifically PCGrad). Imagine a traffic cop standing at the intersection. When the two signals conflict, the cop doesn't let them cancel each other out. Instead, the cop projects one signal onto a path that doesn't block the other.
- Analogy: Think of it like two people trying to push a heavy box. One wants to push North, the other East. If they push directly against each other, the box doesn't move. The traffic cop tells the "North" pusher to push slightly Northeast instead, so their combined force actually moves the box forward without fighting. This allows the robot to learn both exploration and diversity simultaneously without confusion.

2. The "Double-Engine" Exploration (Entropy + RND)

To make sure the robot explores well, AMPED uses two different "engines" for curiosity:

Engine A (Entropy): This counts how many places the robot has visited. It wants the robot to visit every spot equally, like a mailman who wants to deliver to every house in a neighborhood.
Engine B (RND - Random Network Distillation): This is a "novelty detector." It has a random, frozen "target" brain and a "predictor" brain. If the predictor guesses the target's output wrong, it means the robot is in a new, strange place. The robot gets a reward for being surprised.
Why both? Engine A is great at first but gets slow and messy as the map gets huge. Engine B is fast and great at finding new things but can get noisy early on. AMPED combines them so the robot is curious and efficient at all stages.

3. The "Skill Selector" (The Smart Conductor)

Once the robot has pre-trained and learned a library of diverse skills (like a musician learning scales, chords, and arpeggios), it needs to apply them to a real task (like playing a song).

The Old Way: Previous methods would just pick a skill at random, like a conductor randomly shouting "Play the violin!" or "Play the drums!" without listening to the music.
The AMPED Way: They introduce a Skill Selector. This is like a smart conductor who listens to the current situation (the state of the maze) and picks the perfect skill for the moment.
- Analogy: If the robot sees a high wall, the selector picks the "Jump" skill. If it sees a narrow hallway, it picks the "Crawl" skill. This makes the robot much faster at solving new problems because it doesn't have to relearn everything from scratch; it just picks the right tool from its toolbox.

4. The Result: A Super-Adaptable Robot

The paper proves that by using this "traffic cop" to balance the conflicting goals, and by using a smart "conductor" to pick skills later, the robot:

Learns a much wider variety of skills than before.
Explores the environment more thoroughly.
Adapts to new tasks much faster (using fewer examples).

In a nutshell:
AMPED is like a training program for a robot that stops it from getting confused by conflicting instructions. It uses a "traffic cop" to let the robot be both a curious explorer and a disciplined skill-learner at the same time. Then, when it's time to work, it uses a "smart manager" to pick the exact right skill for the job. The result is a robot that is ready for anything, anywhere.

1. Problem Statement

Skill-Based Reinforcement Learning (SBRL) aims to pretrain a skill-conditioned policy to enable rapid adaptation to downstream tasks with sparse rewards. Effective SBRL requires balancing two often conflicting objectives during the pretraining phase:

Exploration: Maximizing state entropy to ensure the agent visits a wide range of states (preventing premature specialization).
Skill Diversification: Maximizing the mutual information (MI) between skills and state trajectories to ensure distinct, distinguishable behaviors.

The Core Challenge: Existing methods often fail to optimize both simultaneously.

MI-driven methods (e.g., DIAYN, BeCL) often induce premature specialization, curtailing exploration.
Entropy-driven methods (e.g., CIC, APT) encourage broad exploration but often sacrifice skill distinguishability, limiting downstream utility.
Gradient Conflict: Optimizing these two objectives jointly using a single network leads to gradient conflicts, where updates beneficial for one objective negatively impact the other, resulting in inefficient learning.

2. Methodology: AMPED

The authors propose AMPED (Adaptive Multi-objective Projection for balancing Exploration and skill Diversification), a framework that explicitly resolves the tension between exploration and diversity through three main components:

A. Dual Intrinsic Reward Formulation

AMPED combines two distinct intrinsic reward signals:

Exploration Reward ( $r_{exploration}$ ): A linear combination of:
- Particle-based Entropy: Encourages visiting diverse states by maximizing state occupancy entropy.
- Random Network Distillation (RND): Provides a model-based novelty signal (prediction error of a fixed random network) to encourage exploration in high-dimensional spaces where entropy estimation is computationally expensive.
Diversity Reward ( $r_{diversity}$ ):
- Uses AnInfoNCE (Anisotropic InfoNCE), a contrastive learning objective. Unlike standard InfoNCE, AnInfoNCE uses a learnable diagonal matrix to capture asymmetries in latent factors.
- This objective maximizes the mutual information between states generated by the same skill while pushing apart states generated by different skills, ensuring strong skill separation.

B. Gradient Surgery (PCGrad)

To address the gradient conflicts between the exploration and diversity objectives, AMPED employs Gradient Surgery (specifically the PCGrad algorithm by Yu et al., 2020).

Mechanism: Before updating the policy, the gradients of the exploration loss ( $\nabla L_{exploration}$ ) and diversity loss ( $\nabla L_{diversity}$ ) are computed.
Projection: If the gradients conflict (i.e., their dot product is negative), one gradient is projected onto the orthogonal complement of the other. This removes the component of the gradient that interferes with the other objective.
Update: The adjusted gradients are summed and applied to the network parameters. This ensures that the optimization process does not degrade one objective while improving the other.

C. Adaptive Skill Selection

During the downstream fine-tuning phase, AMPED introduces a Soft Actor-Critic (SAC) based skill selector.

Instead of uniformly sampling skills (common in prior work), the selector learns a policy $p(z|s)$ to dynamically choose the most appropriate pre-trained skill $z$ for the current state $s$ based on the downstream task reward.
This adaptive selection maximizes the utility of the learned diverse skill repertoire.

3. Key Contributions & Theoretical Insights

Theoretical Analysis of Sample Complexity: The authors provide a theoretical proof (Theorem 1) demonstrating that greater skill diversity reduces the sample complexity required for the skill selector to identify the optimal skill for a downstream task. Specifically, if skills are sufficiently diverse (large separation $\delta$ ), the number of samples $n$ required to select the correct skill with high confidence decreases exponentially.
Resolution of Gradient Conflicts: The paper empirically validates that exploration and diversity gradients are highly conflicting (often >99% conflict ratio in certain domains) and demonstrates that explicit projection is necessary for stable optimization.
Novel Objective Integration: The first application of AnInfoNCE for skill diversification, showing it provides a more effective MI estimate than standard InfoNCE in this setting.
Hybrid Exploration: The combination of particle-based entropy (reliable in small buffers) and RND (scalable in large buffers) creates a robust exploration mechanism.

4. Experimental Results

The method was evaluated on the Unsupervised Reinforcement Learning Benchmark (URLB) across three domains: Walker, Quadruped, and Jaco (robotic arm), as well as visualized in Maze environments.

Performance: AMPED achieved state-of-the-art results, significantly outperforming strong baselines including DIAYN, BeCL, CIC, RND, CeSD, ComSD, and APT.
- On URLB, AMPED surpassed the previous state-of-the-art (APT) and diversity-exploration hybrids (CeSD, ComSD) by significant margins (e.g., +20.91% over CeSD, +35.01% over ComSD in aggregate IQM scores).
Ablation Studies:
- Removing any single component (RND, AnInfoNCE, Gradient Surgery, or Skill Selector) led to degraded overall performance, confirming that each element is non-redundant.
- Gradient Surgery: Disabling it caused a performance drop of 4.3% to 26.5% depending on the domain, highlighting the critical nature of conflict resolution.
- Skill Count: The study found that simply increasing the number of skills does not guarantee better performance; an optimal skill dimension (16 in their experiments) is required to balance coverage and separation.
Visualization: In Maze environments, AMPED successfully learned skills that were both well-separated (high diversity) and provided full state coverage (high exploration), whereas other methods typically excelled at only one or the other.

5. Significance and Impact

Harmonizing Conflicting Goals: AMPED provides a principled, theoretically grounded approach to balancing exploration and diversity, moving beyond ad-hoc heuristics or simple weighted sums of objectives.
Efficiency: By resolving gradient conflicts, the method enables more stable and efficient learning, leading to better downstream adaptation with fewer samples.
Generalizability: The framework is applicable to various continuous control tasks and offers a blueprint for multi-objective RL where objectives naturally conflict.
Future Directions: The paper suggests that the core insight of using gradient projection to balance competing learning signals can be extended to other settings with multiple learning objectives, potentially improving robustness in complex RL agents.

In summary, AMPED demonstrates that explicitly managing the gradient conflicts between exploration and skill diversity, combined with adaptive skill selection, leads to superior skill learning and downstream task performance in reinforcement learning.