Actor-Critic Pretraining for Proximal Policy Optimization

Imagine you are trying to teach a robot dog how to run a marathon. You have two main ways to do this:

The "Trial and Error" Method: You let the robot dog start from scratch. It trips, falls, and runs in circles for thousands of miles just to figure out how to move its legs. It learns eventually, but it takes forever and wears out the robot's joints.
The "Apprentice" Method: You hire a master trainer (an expert) to show the robot dog how to run. The robot watches and copies the master. Then, you let the robot run a few more miles on its own to perfect its style.

This paper is about making the Apprentice Method even better.

The Problem: The "Teacher" vs. The "Grader"

In the world of AI (specifically Reinforcement Learning), there are two key parts to the learning brain:

The Actor (The Doer): This is the part that decides what action to take (e.g., "lift left leg").
The Critic (The Grader): This is the part that watches the Actor and says, "Good job!" or "That was a bad move." It estimates how good a situation is.

The Old Way:
Most researchers use the Apprentice Method, but they only train the Actor by copying the expert. They leave the Critic completely random and untrained.

Analogy: Imagine a student (Actor) who has memorized a textbook perfectly. But the teacher grading them (Critic) is a random person who doesn't know the subject and is guessing the grades. The student gets confused because the feedback is inconsistent, slowing down their learning.

The New Idea (Actor-Critic Pretraining):
The authors of this paper say, "Why not train the Critic too?"
They propose a two-step pretraining process before the robot even starts its real training:

Train the Actor: Copy the expert's moves (Behavioral Cloning).
Train the Critic: Let the newly trained Actor run around a bit (simulated "rollouts"). Watch what happens, calculate the rewards, and teach the Critic to predict those rewards accurately.

Analogy: Now, the student (Actor) knows the textbook, and the teacher (Critic) has also studied the textbook and watched the student practice. When the student starts the real marathon, the teacher gives perfect, consistent feedback immediately.

The Secret Sauce: Two Extra Tricks

The paper also introduces two clever tweaks to make this work even better:

1. The "Extended Run" (Extended Step Limit)
Sometimes, the simulation stops the robot after a set time, even if it hasn't finished the task. This is like stopping a race just because the clock hit 10 minutes, even if the runner is still on the track. This tricks the AI into thinking the race is over.

The Fix: The authors tell the AI, "Don't stop yet! Run a little longer than usual so we can see the full picture of the reward." This prevents the AI from getting confused by "fake" endings.

2. The "Residual Brain" (Residual Architecture)
This is a specific way of building the robot's brain.

Analogy: Imagine the robot has a "muscle memory" part (the backbone) that learned from the expert and a "thinking" part (the head) that learns new tricks.
Usually, when you fine-tune the robot, you might accidentally overwrite the muscle memory.
The Fix: The authors connect the "thinking" part directly to the original sensory input (the eyes/ears) via a "residual connection." This ensures that even if the robot learns new things, it never forgets the basic instincts it learned from the expert. It's like having a safety net that keeps the expert's wisdom alive while allowing for new learning.

The Results: Speeding Up the Race

The researchers tested this on 15 different robotic tasks (like walking, reaching for objects, and balancing).

No Pretraining: The robot had to practice for a huge amount of time (100% effort).
Old Way (Actor only): The robot practiced about 30% less.
New Way (Actor + Critic): The robot practiced 86% less than the no-pretraining method!

In simple terms: If it usually took a robot 100 hours to learn a task, this new method got it to the same level in just 14 hours.

The Catch

It's not magic for every situation.

You still need an expert to show the robot what to do first. If you don't have an expert, you can't use this method.
For some very complex tasks (like a humanoid robot with many moving parts), the extra training for the Critic didn't help much.
We still don't know exactly how much expert data or how many "practice runs" are needed for every single robot.

The Bottom Line

This paper is like upgrading a driving school. Instead of just teaching the student driver how to steer (the Actor), they also teach the instructor (the Critic) how to give better feedback based on the student's actual practice. The result? Students learn to drive safely and efficiently in a fraction of the time.

1. Problem Statement

Reinforcement Learning (RL), particularly on-policy algorithms like Proximal Policy Optimization (PPO), suffers from sample inefficiency. Agents require massive numbers of environment interactions to learn optimal policies, which is costly and often impractical in robotics due to hardware wear, time constraints, and safety risks.

While Imitation Learning (IL) and Behavioral Cloning (BC) have been used to pretrain the actor (policy) network using expert demonstrations to reduce sample requirements, the critic (value function) network is typically initialized randomly. The authors argue that this creates a mismatch: the actor starts with expert knowledge, but the critic starts with no understanding of the value landscape, potentially leading to unstable training or "catastrophic forgetting" during fine-tuning.

Core Research Gap: Existing methods focus on actor pretraining but neglect effective critic initialization strategies that are consistent with the pre-trained policy.

2. Methodology

The paper proposes Actor-Critic Pretraining (ACP), a two-stage approach designed to initialize both networks before PPO fine-tuning.

A. Actor Pretraining (Behavioral Cloning)

Objective: Initialize the actor network $\pi_\theta$ to mimic an expert policy $\pi_{exp}$ .
Method: Standard supervised learning (Behavioral Cloning) on an offline dataset of expert state-action pairs ( $D_{exp}$ ).
Loss Function: Minimizes the Mean Squared Error (MSE) between expert actions and the predicted mean actions of the Gaussian policy.
Constraint: The action variance ( $\sigma$ ) is fixed during this phase to prevent the network from learning to be overly confident or uncertain before RL begins.

B. Critic Pretraining (Rollout-Based)

Hypothesis: A critic trained on returns from the expert dataset ( $D_{exp}$ ) may be inconsistent with the pretrained actor's actual performance, as the actor is an approximation, not a perfect copy of the expert.
Method:
1. Perform rollouts in the environment using the pretrained actor policy ( $\pi_\theta$ ) to generate a new dataset ( $D_{rol}$ ).
2. Calculate the actual discounted returns ( $G^{rol}_t$ ) for these trajectories.
3. Train the critic network $v_\phi$ to predict these specific returns.
Loss Function: Minimizes the MSE between the critic's output $v_\phi(s_t)$ and the observed rollout return $G^{rol}_t$ .
Rationale: This ensures the value function is calibrated to the specific policy that will be used in the subsequent PPO fine-tuning, reducing distribution shift.

C. Architectural and Algorithmic Enhancements

Extended Step Limit: To address bias introduced by artificially truncated episode horizons (common in Gym environments), the authors propose extending the rollout horizon ( $T_{ext}$ ) beyond the nominal limit ( $T$ ). This ensures that the truncation error (ignored future rewards) remains below a defined tolerance $\tau$ , calculated via geometric series bounds.
Residual Model Architecture: The actor network uses a residual connection between a "backbone" (feature extractor) and a "decision head."
- Pretraining: Both backbone and head are trained.
- Fine-tuning: The backbone is frozen, and only the decision head is updated.
- Benefit: This preserves the "expert instinct" learned during pretraining while allowing the head to adapt to new states, mitigating catastrophic forgetting.

3. Key Contributions

Theoretical Framework: A novel pretraining strategy for PPO that jointly initializes both actor and critic networks, addressing the specific mismatch between policy and value estimation.
Critic Initialization via Rollouts: A specific method to generate consistent value targets for the critic using the pretrained actor's own rollouts, rather than relying on static expert data.
Architectural Innovation: The introduction of a residual actor architecture with a frozen backbone during fine-tuning to balance stability and adaptability.
Comprehensive Evaluation: Empirical validation across 15 simulated robotic tasks (manipulation and locomotion) from the Gymnasium and Gymnasium-Robotics suites.

4. Experimental Results

The approach was evaluated against three baselines:

NP: No pretraining (standard PPO).
AP: Actor-only pretraining (BC + PPO).
PIRL: A state-of-the-art alternative where the actor is frozen until the critic converges.

Key Findings:

Sample Efficiency: ACP reduced the total number of environment steps required to reach target performance by 86.1% on average compared to NP, and by 30.9% compared to AP.
Comparison to PIRL: ACP outperformed the PIRL approach by 20.5% on average in terms of sample efficiency.
Catastrophic Forgetting: In environments like Ant and Walker2D, AP suffered from performance drops (forgetting expert behavior) during early fine-tuning. ACP mitigated this issue, maintaining performance closer to the expert baseline.
Rollout Impact: For 80% of environments, incorporating rollout data for critic pretraining reduced total training steps. However, a saturation point exists; excessive rollouts yield diminishing returns.
Feature Effectiveness:
- Extended step limits reduced steps by 10.4%.
- Residual architecture reduced steps by 22.1% (relative to non-residual ACP).
Failure Cases: In 3 out of 15 environments (e.g., Humanoid, InvertedDoublePendulum), ACP did not outperform AP. The authors note these environments have high observation dimensionality, suggesting the method may be environment-dependent.

5. Significance and Conclusion

This paper demonstrates that critic initialization is as critical as actor initialization in actor-critic RL algorithms. By aligning the value function with the pre-trained policy through rollout-based pretraining, the authors achieve significantly faster convergence and higher sample efficiency.

Implications:

Robotics: The method makes RL more viable for real-world robotics by drastically reducing the number of physical interactions (and thus wear and tear) required for training.
Safety: Starting from a competent policy reduces the risk of the agent exploring dangerous states during the early, random phase of RL.
Generalizability: While focused on PPO and continuous action spaces, the core concept of consistent actor-critic initialization is applicable to other actor-critic algorithms.

Limitations & Future Work:

Requires access to expert demonstrations (which may not always exist).
Hyperparameters for the amount of expert vs. rollout data are environment-specific and non-linear.
Future work needs to address the specific conditions under which critic pretraining fails (e.g., high-dimensional observation spaces) and extend the method to off-policy algorithms and discrete action spaces.

Actor-Critic Pretraining for Proximal Policy Optimization

The Problem: The "Teacher" vs. The "Grader"

The Secret Sauce: Two Extra Tricks

The Results: Speeding Up the Race

The Catch

The Bottom Line

1. Problem Statement

2. Methodology

A. Actor Pretraining (Behavioral Cloning)

B. Critic Pretraining (Rollout-Based)

C. Architectural and Algorithmic Enhancements

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank