Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

This paper proposes a sample-efficient offline reinforcement learning approach for fixed-wing aircraft lateral attitude tracking that leverages system symmetry to augment training data and employs a dual-critic Deep Deterministic Policy Gradient (DDPG) structure to accelerate policy convergence.

Original authors: Yifei Li, Erik-Jan van Kampen

Published 2026-04-14
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a robot pilot how to fly a plane. The plane is tricky; if you turn the stick to the left, it banks left. If you turn it to the right, it banks right. The physics are perfectly symmetric: the world looks the same whether you are flying "up" or "down" relative to the horizon, just mirrored.

Usually, to teach a robot this, you let it crash and try again thousands of times (a process called Reinforcement Learning). But flying a real plane is expensive, and crashing is dangerous. So, we use a simulator. Even in a simulator, the robot has to "explore" the sky to learn. It might spend hours flying only in the northern sky, never realizing that flying south is just the mirror image of what it already knows. This is inefficient.

This paper proposes a clever trick to make the robot learn twice as fast without flying twice as much. Here is how it works, broken down into simple concepts:

1. The "Mirror World" Trick (Symmetric Data Augmentation)

Imagine you are learning to juggle. You practice juggling three balls with your right hand. You get pretty good at it.
Now, imagine you have a magical mirror. If you look in the mirror, you see yourself juggling with your left hand. You didn't actually do the left-hand juggling, but your brain knows exactly how it would feel because the physics are symmetrical.

In this paper, the researchers do the same thing with flight data.

  • The Problem: The robot flies a simulation, collects data (State A + Action A = Result B), and stores it.
  • The Trick: The computer takes that single piece of data and creates a "mirror image" of it instantly. If the robot flew a bank angle of +10 degrees with a rudder input of +5, the computer creates a fake data point: "Bank angle -10 degrees with rudder input -5."
  • The Result: The robot thinks it has flown both scenarios. It learns the "left" side of the sky by studying the "right" side. This doubles the amount of training data without the robot ever having to actually fly the extra miles.

2. The "Two-Coach" System (Dual-Critic Structure)

In Reinforcement Learning, there are usually two main characters:

  • The Actor: The pilot (the brain that decides what to do).
  • The Critic: The coach (the brain that says, "Good job!" or "That was a bad move").

The authors realized that if you mix the "real" flight data with the "mirror" data in one big bucket, the coach gets confused. It's like a coach trying to grade a student's homework while simultaneously grading a photocopy of that homework. The signal gets muddy.

Their Solution: They built two separate coaches.

  • Coach 1 only looks at the real flight data the robot actually experienced.
  • Coach 2 only looks at the mirror (augmented) data.

The robot (the Actor) listens to both coaches. Coach 1 says, "You did well here in the real world." Coach 2 says, "And based on the mirror world, you would have done well there too." By separating the lessons, the robot learns more efficiently because it isn't getting mixed signals.

3. The "Two-Step" Dance (Two-Step Approximate Value Iteration)

Instead of having the robot take one step forward, look at the data, and stop, they made it take a two-step dance:

  1. Step 1: The robot practices on the real data. It updates its pilot skills based on what actually happened.
  2. Step 2: Immediately after, the robot practices on the mirror data. It updates its pilot skills again, this time using the "what if" scenarios.

This allows the robot to refine its strategy twice as often in the same amount of time, leading to much faster learning.

4. The Smoothness Rule (CAPS)

There's one more problem: Robots can be jerky. They might jerk the controls violently because they are over-analyzing.
The authors added a rule called Conditioning for Action Policy Smoothness (CAPS). Think of this as a "smooth driving" penalty. If the robot makes a sudden, jerky move, it gets a "frown" (a penalty). If it makes a smooth, gradual turn, it gets a "thumbs up." This ensures the final flight control isn't just accurate, but also comfortable and safe, just like a human pilot.

The Big Picture: Why Does This Matter?

The researchers tested this on a fixed-wing aircraft model.

  • Without the trick: The robot had to fly a lot to learn how to handle turns in both directions. It struggled when asked to fly in a direction it hadn't explored much.
  • With the trick: The robot learned much faster. Because it "imagined" the mirror world, it could handle turns in the negative direction (which it never actually flew) just as well as the positive direction.

In summary: This paper teaches us that we don't always need to experience everything to learn it. By understanding the symmetry of the world (that left is just the mirror of right), we can create a "virtual training ground" that doubles our learning speed, saves computational power, and makes AI pilots smarter and safer.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →