Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics

Imagine you are trying to teach a robot dog to run a marathon while also carrying a tray of full coffee cups without spilling a drop.

If you tell the robot, "Run fast, but don't spill the coffee, and use as little energy as possible," right from the very first second, the robot will likely get confused. It might decide that the safest way to not spill coffee is to stand perfectly still. Or, it might run so fast that it trips and drops everything. It gets stuck in a "local trap" where it thinks it's doing a good job, but it's actually failing the main task.

This is the problem the paper solves. The authors propose a two-stage training plan (a "curriculum") that separates the hard work from the polishing.

Here is the breakdown using simple analogies:

1. The Problem: The "Overwhelmed Student"

In traditional Reinforcement Learning (RL), we give the robot a single "scorecard" (reward function) that mixes everything together:

Task: Get to the finish line.
Behavior: Don't spill coffee, save energy, move smoothly.

When these goals conflict (e.g., moving smoothly takes more time, but you need to be fast), the robot gets stuck. It tries to game the system (called "reward hacking") by finding a cheap trick to get a high score without actually learning the skill.

2. The Solution: The "Two-Stage Curriculum"

The authors suggest teaching the robot in two distinct phases, like a music teacher teaching a student a new song.

Phase 1: The "Rough Draft" (Task Only)

The Analogy: Imagine a writer trying to write a novel. In the first draft, they don't worry about grammar, spelling, or perfect sentence structure. They just focus on getting the story down on paper.

What the robot does: The robot is told to ignore the "coffee cup" and "energy" rules. It only cares about getting to the finish line.
Why: This allows the robot to explore wildly, make mistakes, and figure out how to move its legs to run. It learns the core mechanics without being paralyzed by the fear of spilling coffee.

Phase 2: The "Polishing" (Adding Behavior)

The Analogy: Once the story is written, the editor comes in. Now, the writer goes back and fixes the grammar, smooths out the sentences, and makes sure the pacing is perfect.

What the robot does: The robot is now told, "Okay, you know how to run. Now, let's add the rules: move smoothly, save energy, and don't spill."
The Secret Sauce: The authors don't just flip a switch. They slowly "turn up the volume" on the behavior rules. It's like a dimmer switch, not a light switch. This prevents the robot from panicking and forgetting how to run.

3. The "Time Travel" Trick (Reusing Experience)

One of the smartest parts of this paper is how they handle the robot's memory.

The Analogy: Imagine you are learning to drive.

Old Way: You practice on an empty track (Phase 1). Then, suddenly, you are told to drive in heavy rain (Phase 2). You throw away all your practice logs and start over because the conditions changed.
This Paper's Way: You practice on the empty track. When you move to the rainy track, you keep your practice logs. You look back at your dry-weather drives and say, "Okay, I turned the wheel here to stay on the road. In the rain, I should do the same, but turn a bit more gently."

The paper uses a "replay buffer" (a memory bank) that stores the robot's past moves. When the rules change, the robot re-evaluates those old moves with the new rules. This makes training much faster and more stable.

4. The Results: Why It Works

The authors tested this on various robots (walking robots, mobile robots, robotic arms).

The Baseline (Old Way): When they tried to teach the robot everything at once, it often failed, especially if the "behavior" rules (like saving energy) were too strict. The robot would just sit still to save energy.
The New Way (Two-Stage): The robot learned the task first, then learned to be polite and efficient. It succeeded much more often and was less sensitive to how strict the rules were.

Summary

Think of this method as learning to ride a bike with training wheels, but the training wheels are removed gradually, not all at once.

First: Learn to balance and pedal (the Task).
Then: Learn to ride smoothly and efficiently (the Behavior).
Finally: You have a robot that can run a marathon and carry coffee without spilling, because it learned the basics before worrying about the details.

This approach solves the "conflicting goals" problem by ensuring the robot masters the goal before it tries to master the style.

Here is a detailed technical summary of the paper "Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics."

1. Problem Formulation

The paper addresses a critical bottleneck in applying Deep Reinforcement Learning (RL) to real-world robotics: the difficulty of designing effective reward functions for multi-objective tasks.

The Challenge: Real-world robotic tasks often require optimizing conflicting objectives simultaneously (e.g., reaching a goal vs. minimizing energy consumption or ensuring smooth motion).
Reward Hacking & Local Optima: When a single reward function combines a primary task ( $r_{base}$ ) with auxiliary behavioral terms ( $r_{aux}$ , such as energy efficiency or jerk reduction), the agent often gets stuck in local optima. For instance, an agent might learn to remain stationary to minimize energy (satisfying $r_{aux}$ ) without ever learning the primary task ( $r_{base}$ ).
Sensitivity to Weights: Finding the correct weighting ( $w$ ) between task and behavior rewards is non-trivial. If $w$ is too high, exploration is stifled; if too low, the behavioral objective is ignored. This requires extensive manual tuning, which hinders scalability.

2. Methodology: Two-Stage Reward Curriculum

The authors propose a Two-Stage Reward Curriculum that decouples the learning of the primary task from the optimization of behavioral constraints. The method is integrated into off-policy RL algorithms, specifically SAC (Soft Actor-Critic) and TD3 (Twin-Delayed DDPG).

Core Concept

The total reward is defined as a weighted sum:
$r_w = (1 - w) \cdot r_{base} + w \cdot r_{aux}$
where $w \in [0, 1]$ . The curriculum dynamically adjusts $w$ over two phases:

Phase 1 (Task Acquisition): The agent trains exclusively on the base reward ( $w=0$ ). This allows the agent to explore the environment and discover successful trajectories for the primary task without being penalized by auxiliary constraints.
Phase 2 (Behavioral Refinement): Once the agent has sufficiently mastered the task, the curriculum transitions to optimizing the full reward. The weight $w$ is annealed from 0 to a target weight $w_{target}$ over a fixed duration ( $T_{ann}$ ).

Key Technical Components

Phase Switching Mechanisms: The paper evaluates three strategies to determine when to transition from Phase 1 to Phase 2:
- Actor Fit Threshold: Switch when the actor loss drops below a specific threshold for $m$ steps.
- Base Reward Threshold: Switch when the average base reward exceeds a target value.
- Base Reward Convergence (Proposed): Switch when the slope of the performance trend (estimated via Huber Regression) plateaus, indicating the agent has stopped improving on the base task. This method is hyperparameter-robust.
Transition Dynamics: The weight $w$ is annealed using linear or cosine schedules. The authors find that while longer annealing periods (e.g., 200k steps) are slightly better, the method is robust to the specific schedule chosen.
Flexible Replay Buffer (Sample Reuse): A crucial innovation is the storage of experience tuples containing both $r_{base}$ and $r_{aux}$ in the replay buffer. When transitioning to Phase 2, the algorithm recalculates the reward $r_w$ for past samples using the current annealing weight $w$ . This allows the agent to reuse valuable data collected during the exploration-heavy Phase 1, significantly improving sample efficiency and training stability.

3. Key Contributions

Novel Curriculum Framework: Introduction of a two-stage reward curriculum that separates task learning from behavioral optimization, effectively mitigating reward hacking and local optima issues.
Algorithm Integration: The framework is successfully adapted to both SAC and TD3, demonstrating broad applicability to off-policy RL.
Ablation Studies: Comprehensive analysis of:
- Switching Criteria: Proving that waiting for performance convergence (rather than fixed time steps) is effective.
- Transition Dynamics: Showing that the method is robust to the speed of the weight annealing.
- Experience Reuse: Demonstrating that reusing past samples via a flexible replay buffer is critical for stability; resetting the buffer or network weights leads to instability.
Robustness: The method reduces the need for precise manual tuning of reward weights ( $w_{target}$ ).

4. Experimental Results

The method was evaluated on three diverse benchmarks:

DeepMind Control Suite (12 environments): Modified to include acceleration penalties.
ManiSkill3 (4 manipulation tasks): Modified to penalize jerk, effort, and ensure smoothness.
Mobile Robot Environment: A navigation task with dynamic obstacles and trajectory tracking.

Key Findings:

Performance: The curriculum-based agents (RC-SAC, RC-TD3) substantially outperformed baselines trained directly on the full reward.
- DM Control: Average reward improved from 0.637 to 0.690; base reward from 0.419 to 0.594.
- MobileRobot: Success rate increased from 52.4% to 65.8%.
- ManiSkill3: Success rate for $w_{target}=0.25$ jumped from 62.1% to 97.6%.
Robustness to Weights: The curriculum method maintained high success rates across a wide range of $w_{target}$ values (0.25 to 0.75), whereas baselines failed significantly as $w_{target}$ increased.
Handling Conflicts: The method excelled in environments where behavioral terms (like energy efficiency) severely hindered exploration. In cases where baselines failed to learn the task entirely (e.g., "finger-spin"), the curriculum version achieved near-perfect performance.

5. Significance

This work provides a practical solution to one of the most persistent challenges in robotic RL: reward engineering.

Decoupling Strategy: By separating the "learning to do the task" phase from the "learning to do it efficiently/smoothly" phase, the method ensures the agent does not get trapped in local optima where it sacrifices the task for behavioral constraints.
Reduced Tuning Burden: It alleviates the need for experimenters to meticulously tune reward weights, making RL more accessible for complex, real-world robotic applications where multiple conflicting objectives are inevitable.
Sample Efficiency: The reuse of past experiences across curriculum phases makes the training process more data-efficient, a critical factor for real-world robotics where data collection is expensive.

In conclusion, the proposed two-stage reward curriculum offers a simple yet highly effective mechanism to stabilize training and improve policy quality in multi-objective robotic control tasks.