Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

Imagine you are teaching a toddler how to drive a race car.

The Old Way (Standard RPL):
You strap the toddler into the driver's seat, but you also have a professional racing instructor sitting right next to them, holding a second steering wheel. The toddler can turn the wheel, but the instructor's wheel is slightly stronger. If the toddler tries to make a crazy move, the instructor's wheel overpowers them and keeps the car safe.

This works great for the first few laps. The toddler learns the basics without crashing. But here's the problem: You can never take the instructor out. Even after the toddler becomes a pro, the instructor is still there, fighting against the toddler's new, faster ideas. The car is slower because it's constantly being held back by the "safety net." Also, the instructor needs a map and GPS to do their job, so the car needs expensive, heavy equipment just to run.

The New Way (This Paper's "α-RPO"):
The authors of this paper came up with a smarter training method called Attenuated Residual Policy Optimization (α-RPO). Think of it as a "fading mentor" approach.

The Fading Mentor: You start the same way: the toddler (the AI) has the instructor (the base policy) helping them. But, as the training goes on, you slowly turn down the volume on the instructor.
- Early training: The instructor is loud and clear, guiding the car safely.
- Mid training: The instructor starts whispering suggestions.
- End training: The instructor is completely silent. The toddler is now driving alone, but they learned how to drive because of the instructor's early help.
The "Ghost" Advantage: Because the instructor is eventually removed, the final driver doesn't need the instructor's expensive tools (like GPS or complex maps). They can drive using only what they can see right in front of them (like a camera or laser scanner). This makes the car lighter, faster, and cheaper to build.
The Secret Sauce (Synchronization): There was a risk that if you turned the instructor's volume down too fast, the student would get confused because the rules of the game kept changing. The authors invented a "synchronization trick." It's like the teacher whispering the old rules while the student practices, but grading the student based on the new rules. This keeps the learning stable and prevents the student from panicking.

Why This Matters for Real Life

The team tested this on tiny 1:10 scale race cars (called Roboracer) that race in circles.

In the Simulation: The "fading mentor" cars were faster and crashed less than the cars that kept the instructor forever. They learned to take corners more aggressively and drive closer to the wall, which is how real race cars win.
In the Real World: This is the magic part. They trained the cars in a computer simulation and then put them on a real track without any extra tuning. The car, which had never seen the real track before, drove perfectly. It didn't need a map or GPS; it just reacted to the walls in front of it.

The Bottom Line

This paper solves a big problem in robotics: How do you teach a robot to be safe while learning, without making it dependent on that safety net forever?

By using α-RPO, they created a system where the robot learns from a "crutch" but eventually kicks it away. The result is a robot that is:

Smarter: It learns faster and drives better.
Simpler: It doesn't need complex, heavy equipment to run.
Ready for Reality: It can jump from a computer simulation to the real world instantly (zero-shot transfer) and handle obstacles like a pro.

It's like teaching a child to ride a bike with training wheels, but instead of taking the wheels off and letting them fall, you slowly shrink the wheels until they disappear, leaving the child perfectly balanced and ready to race.

1. Problem Statement

The paper addresses the challenges of deploying Deep Reinforcement Learning (DRL) in real-world autonomous racing, specifically focusing on the Sim-to-Real gap and system complexity.

Residual Policy Learning (RPL) Limitations: While RPL (where a DRL agent refines a static base policy) has shown success, it typically relies on a static mixing parameter between the base and residual policies.
- Training Issue: A fixed weight prevents the agent from fully exploring its potential; it must constantly override the base policy, which can be suboptimal.
- Deployment Issue: Standard RPL requires executing both the base policy and the residual network during inference. If the base policy relies on complex inputs (e.g., precise localization/GPS), the deployment system becomes heavy, computationally expensive, and difficult to integrate.
Goal: The authors aim to create a learning framework that leverages the stability of a base policy during training but yields a standalone, lightweight neural policy for deployment, capable of zero-shot transfer to real-world hardware without requiring the base policy's specific sensor modalities.

2. Methodology: Attenuated Residual Policy Optimization ( $\alpha$ -RPO)

The authors propose $\alpha$ -RPO, an extension of RPL integrated with Proximal Policy Optimization (PPO). The core innovation is the progressive attenuation of the base policy's influence during training.

A. Core Mechanism: Attenuation

Unlike standard RPL where the action is $a = \mu_B(s) + \omega \cdot a_R$ (with constant $\omega$ ), $\alpha$ -RPO introduces a time-dependent attenuation factor $\alpha \in [0, 1]$ :
$\mu_\theta(s; \alpha) = (1 - \alpha) \cdot \mu_B(s) + \max(\alpha, \alpha_{init}) \cdot f_{R,\theta}(s)$

Early Training ( $\alpha \approx 0$ ): The agent relies heavily on the base policy (e.g., Stanley controller), providing a strong inductive bias to bootstrap learning and prevent early crashes.
Late Training ( $\alpha \to 1$ ): The influence of the base policy is gradually reduced to zero. By the end of training, the agent acts solely based on the residual network $f_{R,\theta}(s)$ , effectively learning to replace the base policy entirely.
Privileged Learning: Because the base policy is only used during training, it can utilize privileged information (like global localization) that is not available at deployment. The final neural policy learns to achieve the same goals using only onboard sensors (e.g., LiDAR).

B. The Synchronization Trick

A critical technical challenge is that changing $\alpha$ makes the environment non-stationary from the perspective of the residual policy. To mitigate this within PPO, the authors introduce a synchronization trick:

Data Collection: The agent collects rollouts using the behavior policy with the current attenuation factor $\alpha_k$ .
Update: The attenuation factor is updated to $\alpha_{k+1}$ after data collection but before the optimization step.
Optimization: The PPO objective uses importance sampling to compute the loss for the target policy (with $\alpha_{k+1}$ ) using data collected from the behavior policy (with $\alpha_k$ ). This ensures unbiased updates despite the shifting policy distribution.

C. Implementation Details

Base Policy: Tested with both the Stanley controller (requires localization) and Follow-the-Gap (FTG) (reactive, LiDAR-only).
Network Architecture: A shared encoder (5x 1D Conv layers for LiDAR) feeding into separate heads for the actor (residual) and critic. The actor outputs a Truncated-Gaussian distribution.
Reward: Based on lap progress along the centerline, penalizing steering changes and collisions.

3. Key Contributions

$\alpha$ -RPO Algorithm: A novel RPL variant that dynamically attenuates the base policy, enabling the learning of a standalone neural policy that outperforms the base policy.
Synchronization Trick: A method to integrate attenuation schedules into PPO, ensuring training stability by leveraging importance sampling to handle non-stationarity.
Real-World Framework: Development of a complete autonomous racing framework for 1:10 scale Roboracer cars, demonstrating zero-shot sim-to-real transfer.
Privileged Learning: Demonstrating that a base policy using localization can train a controller that operates purely on LiDAR, simplifying deployment.

4. Experimental Results

The method was evaluated on 1:10 scale Roboracer cars in simulation and on a real-world track (Munich).

A. Simulation Performance

Training Efficiency: $\alpha$ -RPO achieved the highest average lap progress (0.99) and the lowest collision count (<1,500 total) compared to standard DRL, RPL, and BC-pretrained DRL.
Generalization: On unseen test tracks, $\alpha$ -RPO maintained robust performance with zero collisions, whereas standard RPL degraded significantly (0.22 collisions/lap) and had slower lap times.
Speed: $\alpha$ -RPO achieved the highest average maximum speed (5.41 m/s), indicating it learned to race closer to physical limits than baselines.

B. Real-World Deployment (Zero-Shot Transfer)

Hardware: Deployed on a Roboracer car with an NVIDIA Jetson Orin Nano Super.
Latency: Inference time was 3.5 ms (including LiDAR preprocessing), significantly faster than localization-based stacks (~7.5 ms).
Performance:
- On the Munich track (unseen during training), $\alpha$ -RPO achieved a lap time of 28.1s (fine-tuned) vs. 32.7s for the Stanley controller (a >12% improvement).
- It successfully navigated obstacles and maintained high speeds (~5.0 m/s) without the base policy's localization data.
Qualitative: The agent learned to cut corners more aggressively and follow the inner wall more closely than the Stanley controller, optimizing the racing line dynamically.

5. Significance and Impact

Deployment Efficiency: By eliminating the need for the base policy at inference time, $\alpha$ -RPO reduces system complexity, computational load, and sensor requirements (removing the need for GPS/Localization in the final controller).
Bridging the Sim-to-Real Gap: The method successfully transfers complex racing behaviors learned in simulation to real hardware without fine-tuning on the real track (zero-shot), a major hurdle in robotics.
Scalability: The approach offers a pathway for applying RPL to other robotic domains where safety-critical base policies are needed for training but lightweight, reactive policies are required for deployment.
Open Source: The authors released the code and framework, facilitating further research in autonomous racing and DRL deployment.

In conclusion, $\alpha$ -RPO successfully resolves the trade-off between the stability provided by classical controllers and the high-performance potential of DRL, resulting in a robust, efficient, and standalone controller for real-world autonomous racing.

Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

Why This Matters for Real Life

The Bottom Line

1. Problem Statement

2. Methodology: Attenuated Residual Policy Optimization (α\alphaα-RPO)

A. Core Mechanism: Attenuation

B. The Synchronization Trick

C. Implementation Details

3. Key Contributions

4. Experimental Results

A. Simulation Performance

B. Real-World Deployment (Zero-Shot Transfer)

5. Significance and Impact

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks

2. Methodology: Attenuated Residual Policy Optimization ( $\alpha$ -RPO)