Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application

Imagine you are teaching a toddler how to park a car in a very tight, underwater garage. But there's a catch: the ocean is dark, the water pushes the car around unpredictably, and if you bump the wall too hard, the car breaks.

Teaching a robot to do this using old-school programming is like giving the toddler a rigid set of instructions: "Turn left 30 degrees, then stop." If the water pushes the car even a tiny bit, the instructions fail, and the robot crashes.

This paper is about a smarter way to teach the robot: Deep Reinforcement Learning (DRL). Think of this as letting the robot learn by trial and error, just like a human learns to ride a bike.

Here is the story of how the researchers taught the Girona AUV (an underwater robot) to dock, broken down into simple concepts:

1. The "Video Game" Problem (Sim-to-Real)

You can't let a real robot crash 10,000 times in a real ocean to learn how to park; it would be too expensive and dangerous. So, the researchers built a super-realistic video game (a "digital twin") using a simulator called Stonefish.

The Analogy: Imagine a flight simulator for pilots. It looks and feels like flying a real plane, but if you crash, you just hit "reset" and try again.
The Challenge: Usually, what a robot learns in a video game doesn't work in the real world because the game physics are too perfect. The real ocean has messy currents and noisy sensors.
The Fix: The researchers made their "game" messy on purpose. They added fake sensor noise and random water currents so the robot learned to be tough, not just perfect in a sterile environment.

2. The Speed Hack (Multiprocessing)

Learning takes a long time. If the robot tried one move every second, it would take years to learn.

The Analogy: Imagine trying to learn a new language by reading one word a day. Now imagine you have 20 friends reading 20 words a day and teaching you simultaneously.
The Fix: They ran 20 copies of the simulation at the same time on their computer. Instead of the robot practicing for 3 hours, it practiced the equivalent of 60 hours in just a few hours. This is like fast-forwarding the robot's life to get it "mature" quickly.

3. The Teacher's Scorecard (Reward Function)

How does the robot know if it's doing a good job? It needs a scorecard. The researchers designed a complex scoring system:

Distance Points: You get points for getting closer to the dock.
Angle Points: You get points for lining up straight.
Smoothness Points: If you jerk the controls around wildly, you lose points. The robot learns to move gently, like a cat, rather than a drunk driver.
The "Ouch" Penalty: If the robot hits the dock too hard, it gets a big negative score. This teaches it to "brake" before impact.

4. The "Aha!" Moments (Emergent Behaviors)

The most exciting part is what the robot figured out on its own. The researchers didn't program the robot to do these specific tricks; the robot invented them to get a better score.

The "Pitch Brake": When approaching the dock, the robot learned to tilt its nose up (pitch) to use water resistance as a brake, slowing itself down smoothly.
The "Wiggle Dance": As it got very close, the robot started shaking its tail (yaw oscillation) slightly. This helped it slide perfectly into the narrow docking funnel, correcting tiny misalignments that a standard computer program would miss.

5. The Real-World Test

Finally, they took the robot out of the "video game" and into a real, 19-meter-long water tank.

The Result: The robot, which had never seen the real tank before, successfully docked 8 out of 10 times.
Why it matters: It proved that the "video game" training was so good that the robot didn't get confused when the real water pushed it or when the sensors were a bit fuzzy. It transferred its "muscle memory" from the computer to the real world.

The Big Picture

This paper shows that we don't need to write complex code to tell a robot exactly how to move. Instead, we can build a realistic, slightly chaotic "training camp" in a computer, let the robot play thousands of games to figure out the best moves, and then send it out to do the real job.

It's like training a dog not by commanding "Sit, Stay, Roll Over," but by playing fetch in a park until the dog figures out the best way to catch the ball, even if the wind blows it off course. The result? A robot that is adaptable, robust, and surprisingly clever.

1. Problem Statement

Autonomous underwater docking is a critical task for Autonomous Underwater Vehicles (AUVs) but remains challenging due to unpredictable environmental conditions (currents, sensor noise) and the complexity of 6-DoF (six degrees of freedom) control.

Limitations of Traditional Control: Standard methods like PID or Model Predictive Control (MPC) often degrade when facing unmodeled dynamics or complex environmental perturbations.
DRL Bottlenecks: While Deep Reinforcement Learning (DRL) offers robustness, its practical deployment is hindered by:
1. Training Latency: High computational costs required to train complex policies.
2. Sim-to-Real Gap: Policies trained in simulation often fail in the real world due to discrepancies in physics, sensor noise, and collision modeling.

2. Methodology

The authors propose a systematic framework to bridge the sim-to-real gap using the Girona AUV and the Stonefish simulator.

A. Simulation Environment (Digital Twin)

Platform: The Stonefish simulator was adapted into a multiprocessing RL framework to accelerate training.
- Utilized 20 parallel training threads (headless) and 1 evaluation thread (graphical).
- Achieved simulation speeds up to 5x real-time, balancing speed with high-fidelity hydrodynamics.
High-Fidelity Modeling:
- Dynamics: Accurate AUV hydrodynamic models and thruster configurations.
- Collision: Realistic collision handling between the AUV and the Docking Station (DS), including guiding funnels.
- Sensors: Simulated sensor noise (Gaussian jitter) and occlusion logic. Visual servoing was simplified to a visibility condition during headless training to mimic real-world detection limits.
Randomization: Starting positions and orientations of both the AUV and the DS were randomized in every episode to ensure policy generalization.

B. Algorithm and Policy

Algorithm: Proximal Policy Optimization (PPO) was selected over Soft Actor-Critic (SAC) due to its superior stability and performance in physical test tank experiments.
State Space ( $S$ ):
- Translational error vector ( $\hat{o}_k$ ) in the AUV body frame, augmented with Gaussian noise scaled by distance and visibility.
- Yaw error ( $e_\psi$ ).
- Linear and angular velocities ( $V_k$ ).
- Accelerations ( $A_k$ ) from the IMU.
Action Space ( $A$ ): A 6-DoF force and torque vector ( $F_x, F_y, F_z, T_r, T_p, T_\psi$ ) expressed in the body frame. The AUV distributes these commands across its five thrusters.
Reward Function ( $R$ ): A composite function designed to encourage soft docking:
- Distance ( $r_{dist}$ ): Mahalanobis distance error, prioritizing X and Y axes over Z.
- Orientation ( $r_{angle}$ ): Exponential penalty for yaw error.
- Smoothness ( $r_{smooth}$ ): Penalizes large variations between consecutive actions to prevent jerky movements.
- Collision ( $r_{collision}$ ): Adaptive penalty based on acceleration spikes. The threshold ( $\Gamma_k$ ) adapts dynamically to prevent multiple penalties for a single impact event.
- Mission ( $r_{mission}$ ): Sparse terminal reward for success (+500) or penalty for truncation (-10).

C. System Integration

The agent communicates via ROS interfaces identical to those used by the physical AUV, ensuring the software architecture matches the real deployment.
A downward-facing camera detects a 3D Binary Marker (3DBM) on the docking station to estimate relative pose.

3. Key Contributions

Multiprocessing RL Framework: Adapted the Stonefish simulator to run in a multi-threaded environment, significantly reducing training time while maintaining realistic hydrodynamics.
High-Fidelity Sim-to-Real Environment: Developed a digital twin incorporating precise collision models, sensor noise, and adaptive reward shaping to facilitate seamless transfer.
Robust Control Integration: Integrated position-based visual servoing with DRL, replacing standard control systems and behavior trees.
Physical Validation: Successfully demonstrated autonomous docking on a physical Girona AUV in a test tank, validating the sim-to-real transfer.

4. Results

A. Simulation Performance

Training: Completed in approximately 3 hours on an Intel Core i7 with an RTX 4060 GPU.
Success Rate: The agent achieved a >90% success rate in simulation.
Emergent Behaviors: The agent learned complex tactics without explicit programming:
- Pitch-based braking: Using pitch motion to decelerate during approach.
- Yaw oscillations: Small oscillations in the yaw axis to assist in sliding into the docking station's guiding funnels.

B. Physical Test Tank Experiments

Setup: 10 missions performed in a 19×9×5 m test tank with the Girona AUV.
Performance: 8 out of 10 missions (80%) were successful, with mission times ranging from 30 to 50 seconds.
Validation: The physical agent exhibited behaviors nearly identical to the simulation, including the characteristic yaw oscillations and smooth convergence in X and Y axes.
Safety: Forces were clipped to 25–50% of maximum capacity during physical trials for safety, yet the controller remained effective.

5. Significance and Conclusion

This paper demonstrates that Deep Reinforcement Learning is a viable and robust alternative to traditional control methods for complex underwater tasks.

Bridging the Gap: The study proves that high-fidelity simulation combined with realistic reward shaping (specifically adaptive collision penalties and smoothness rewards) can effectively bridge the sim-to-real gap.
Emergent Intelligence: The DRL agent successfully developed tactical behaviors (braking via pitch, alignment via oscillation) that are difficult to encode in traditional controllers but essential for soft docking.
Future Work: The authors plan to introduce dynamic currents, moving docking stations, and thruster randomization to further enhance robustness against hardware variations.

In summary, the research provides a reliable pipeline for deploying autonomous RL-based controllers in sensitive underwater environments, overcoming the latency and transfer challenges that have previously limited DRL adoption in marine robotics.