ProgAgent:A Continual RL Agent with Progress-Aware Rewards

Imagine teaching a robot to do chores. In the old days, if you wanted the robot to learn how to open a door, you had to write a very specific, complex set of rules (a "reward function") telling it exactly what a "good" move looks like. If you then wanted it to learn how to press a button, you'd have to rewrite all those rules from scratch. Worse, as soon as it started learning the new task, it would completely forget how to open the door. This is called catastrophic forgetting.

ProgAgent is a new kind of robot brain designed to solve this. Think of it as a robot that never forgets, learns incredibly fast, and doesn't need a human to write a manual for every single new task.

Here is how it works, broken down into three simple concepts:

1. The "Progress Bar" Teacher (No Manuals Needed)

Usually, robots need a human to say, "Good job!" or "Bad job!" after every move. ProgAgent is different. It watches unlabeled videos of humans doing tasks (like a video of someone opening a door).

Instead of trying to guess what the human is doing, ProgAgent just asks: "How far along are we?"

The Analogy: Imagine you are watching a movie. You don't need to know the plot to know if the movie is at the beginning, the middle, or the end. ProgAgent looks at the start of the video, the current moment, and the end goal, and it calculates a "Progress Score."
The Magic: It turns this score into a reward. If the robot is getting closer to the goal, it gets a "high score." If it's wandering around, the score stays low. This gives the robot a constant, dense stream of feedback (like a progress bar filling up) without needing a human to click a button every time it moves.

2. The "Skeptic" Guardian (Don't Trust the Unknown)

Here is the tricky part: When the robot starts exploring on its own, it might do weird, crazy things that look nothing like the human videos. A normal AI might get confused and think, "Hey, this weird spinning move looks like progress!" and get a high score by mistake. This is called distribution shift.

ProgAgent has a built-in "Skeptic" (an adversarial refinement mechanism).

The Analogy: Imagine a strict teacher. When a student (the robot) tries a new, weird trick that isn't in the textbook, the teacher doesn't immediately give them an A. The teacher says, "I don't recognize this move. Let's assume it's a mistake until you prove otherwise."
The Result: This keeps the robot from getting "high scores" for doing nonsense. It forces the robot to stick to paths that actually look like progress, keeping it safe and stable while it learns.

3. The "Super-Fast" Library (The JAX Engine)

Learning new skills while remembering old ones is computationally heavy. It's like trying to read a new book while simultaneously memorizing every book you've ever read, all in real-time. Most computers get too slow to do this.

ProgAgent uses a special technology called JAX (think of it as a super-charged engine for math).

The Analogy: Instead of a librarian who reads one book at a time, ProgAgent is a librarian who can read 10,000 books simultaneously in a split second.
The Result: It can run thousands of simulations at once. This speed allows it to practice the "forgetting vs. remembering" balance so perfectly that it actually learns better than a robot that has access to a "perfect memory" of all past data.

The Big Picture: Why This Matters

The paper tested ProgAgent on a series of tasks (like pressing buttons, opening doors, closing windows).

Old Robots: Learned the new task but forgot the old ones, or learned very slowly because the rewards were sparse.
ProgAgent: Learned the new task quickly, remembered the old ones perfectly, and did it all by just watching videos of humans.

In summary: ProgAgent is a robot that learns by watching videos and asking, "Am I getting closer?" It has a built-in skeptic to stop it from getting confused by its own mistakes, and it runs on a super-fast engine that lets it practice millions of times in the time it takes a normal robot to practice once. It's a major step toward robots that can truly live and learn with us in the real world.

1. Problem Statement

The paper addresses two critical bottlenecks in Lifelong Robotic Learning (Continual Reinforcement Learning - CRL):

Catastrophic Forgetting: As agents learn new tasks, they tend to overwrite knowledge from previous tasks, undermining long-term autonomy.
Reward Specification: Designing dense, well-shaped reward functions for complex manipulation tasks is labor-intensive and often impractical. Relying on sparse rewards or manual design limits scalability.
The Algorithm-System Divide: Existing CRL algorithms often ignore system-level optimizations (like parallelization), while high-throughput systems rarely integrate advanced continual learning mechanisms. Furthermore, visual reward learning models often fail under distribution shifts when an agent explores states not seen in expert data.

2. Methodology

ProgAgent proposes a unified framework that integrates progress-aware reward learning with a JAX-native, high-throughput architecture.

A. Progress-Aware Reward as a Learned Potential Function

Instead of manual reward engineering, ProgAgent derives dense rewards from unlabeled expert videos using a perceptual model ( $E_\phi$ ).

Mechanism: The model takes an observation triplet (initial state $o_i$ , current state $o_j$ , goal state $o_g$ ) and predicts a Gaussian distribution over the progress ratio $\delta = |j-i|/|g-i|$ .
Theoretical Basis: The mean prediction is interpreted as a state-potential function $\Phi_\phi(o_t)$ . The reward is shaped as the difference in potential: $r_t = \gamma\Phi_\phi(o_t) - \Phi_\phi(o_{t-1})$ .
Benefit: This provides dense, monotonic guidance aligned with expert behavior without requiring action labels, theoretically guaranteeing policy invariance (preserving the optimal policy).

B. Adversarial Push-Back Refinement

To address the issue of out-of-distribution (OOD) states encountered during online exploration:

Problem: Pure progress models may become overconfident on novel, non-expert states, generating false positive rewards that lead to "reward hacking."
Solution: An adversarial push-back loss ( $L_{push}$ ) is introduced. It regularizes the reward model by pushing predictions on non-expert trajectories toward a low-confidence prior (zero-mean, high-variance).
Result: This ensures the reward model remains conservative and robust during exploration, preventing misleading gradients.

C. JAX-Native High-Throughput Architecture

ProgAgent leverages JAX to create a fully differentiable, Just-In-Time (JIT) compiled training loop.

Parallelization: Uses jax.vmap to run thousands of environment rollouts in parallel on GPUs.
Unified Loop: The entire process (data collection, reward model updates, and policy optimization) is compiled into a single kernel, eliminating host-device communication bottlenecks.
Scalability: This allows for massive batch sizes and rapid data generation, making complex continual learning objectives computationally feasible.

D. Unified Continual Learning Objective

The policy is optimized using a hybrid objective that balances plasticity (learning new tasks) and stability (retaining old knowledge):
$\mathcal{L}_{total}(\theta) = \mathcal{L}_{PPO}(\theta; \phi) + \lambda_1 \mathcal{L}_{replay}(\theta) + \lambda_2 \mathcal{L}_{SI}(\theta)$

$\mathcal{L}_{PPO}$ : Standard PPO loss using the shaped rewards.
$\mathcal{L}_{replay}$ : Coreset Replay: Replays advantage-weighted samples from a compact buffer of past tasks.
$\mathcal{L}_{SI}$ : Synaptic Intelligence: Penalizes changes to parameters identified as important for previous tasks, acting as a dynamic regularizer.

3. Key Contributions

Progress-Aware Reward Model: A novel method to extract dense, shaped rewards from unlabeled videos, theoretically grounded as a state-potential function that aligns exploration with expert trajectories.
Adversarial Refinement: A mechanism to stabilize reward learning against distribution shifts by regularizing predictions on non-expert trajectories, preventing overconfidence and reward hacking.
Unified JAX-Native System: A high-throughput architecture that JIT-compiles the entire training loop, enabling the practical integration of advanced continual learning techniques (SI + Coreset) with perceptual reward learning at scale.

4. Experimental Results

The authors evaluated ProgAgent on ContinualBench (button-press, door-open, window-close) and Meta-World benchmarks.

Performance: ProgAgent significantly outperformed state-of-the-art baselines, including:
- Visual Reward Methods: Rank2Reward, TCN, GAIL.
- Continual Learning Methods: SI, Coreset, Online Agent (OA).
- Idealized Baseline: It even surpassed the "Perfect Memory" agent (which has access to all historical data), demonstrating that architectural efficiency and high-quality reward shaping can outperform unbounded memory.
Metrics:
- Success Rate: Achieved ~98.8% on button-press (vs. 75.3% for Perfect Memory).
- Average Performance (AP): Highest across all tasks, indicating superior knowledge retention.
- Regret: Lowest regret, showing faster learning efficiency.
Qualitative Analysis:
- The learned potential function showed a smooth, monotonically increasing curve for successful trajectories, while failure trajectories remained flat or low, confirming the model correctly penalizes non-progressive behavior.
- Ablation Studies: Removing the adversarial push-back led to severe performance drops due to distribution shift. Removing continual learning regularizers caused catastrophic forgetting, proving the necessity of the unified objective.

5. Significance

Bridging the Gap: ProgAgent successfully bridges the gap between algorithmic innovation (continual learning, reward learning) and system-level engineering (JAX, parallelization), solving the scalability issues that previously hindered unified CRL agents.
Robustness: The adversarial push-back mechanism provides a robust solution to the "distribution shift" problem common in online reinforcement learning, making the agent safer and more reliable in real-world scenarios.
Practicality: By learning from noisy, few-shot human demonstrations without action labels, ProgAgent offers a scalable path toward autonomous robots that can continuously acquire complex manipulation skills in dynamic environments.
Paradigm Shift: The results suggest that computational efficiency and high-quality reward shaping can be more impactful than simply increasing memory capacity, challenging the traditional view of the stability-plasticity dilemma.