Learning Acrobatic Flight from Preferences

Imagine you are trying to teach a tiny, high-speed drone to perform incredible acrobatic stunts, like a loop-the-loop or a figure-eight in the air.

In the old days of robotics, teaching a drone this was like trying to teach a dog a new trick by writing a very long, complicated rulebook. You'd have to tell the drone: "If you tilt 15 degrees, get 1 point. If you spin too fast, lose 2 points. If you finish the loop in 2 seconds, get 10 points."

The problem? Writing that rulebook is a nightmare.

It takes forever.
It's impossible to capture the "feeling" of a good stunt. A human might say, "That loop looked smooth and cool," but a robot rulebook can't easily measure "cool."
In this paper, the authors found that their hand-written rulebooks only agreed with human judges about 60% of the time. The robots were following the rules, but the humans thought the stunts looked jerky or ugly.

The New Idea: "The Taste Test"

Instead of writing a rulebook, the authors decided to use Preference-Based Learning.

Think of it like a cooking competition. Instead of giving the chef a recipe with exact measurements, you just show them two dishes and ask, "Which one tastes better?"

Scenario A: The drone flies a loop.
Scenario B: The drone flies a slightly different loop.
The Judge (Human or Computer): "I like Scenario A better."

The drone learns from these "A vs. B" choices. Over time, it figures out what looks good without ever being told the specific rules of physics or geometry.

The Problem with the "Taste Test"

There's a catch. Sometimes, two loops look almost the same. The judge might be unsure. "Hmm, maybe A is better, but B isn't bad either."
If you treat the judge's answer as a hard fact ("A is definitely better!"), the robot gets confused and starts guessing wildly. It's like a student who memorizes the answer key but doesn't understand the math, so they fail when the test changes slightly.

The Solution: REC (The "Confident Committee")

The authors created a new method called REC (Reward Ensemble under Confidence). Here is how it works, using a simple analogy:

Imagine you are hiring a team of 10 expert judges instead of just one.

The Committee: When the drone flies two loops, all 10 judges vote on which is better.
The Disagreement: Sometimes, 9 judges say "Loop A," but 1 judge says "Loop B." Or maybe they are all split 50/50.
The Magic: In the old method, the robot ignored this disagreement. In REC, the robot pays attention to the disagreement.
- If the judges all agree, the robot says, "Okay, I know what to do."
- If the judges are confused (high disagreement), the robot says, "I'm not sure yet! I need to try more weird things to figure out what the judges actually like."

This "confusion" actually helps the robot explore new, cool moves instead of getting stuck doing the same boring thing.

The Results: From Simulation to Real Life

The team tested this on a tiny drone (about the weight of a large apple).

The Old Way (Standard Preference Learning): The drone managed to do the stunt, but it was shaky and only achieved about 55% of the performance of a perfect, hand-coded robot.
The New Way (REC): The drone became much more stable and impressive, achieving 88% of the perfect performance.

The coolest part? They trained the drone in a video game (simulation) using these "A vs. B" votes, and then plugged it straight into the real world without any extra tuning. The drone successfully flew complex loops and even invented a new "Figure-8" stunt just by being told what looked good.

Why This Matters

This paper proves that we don't need to be math geniuses writing complex code to teach robots cool skills. We just need to be good judges who can say, "That one looks better."

By building a system that understands when we (the judges) are unsure, the robot learns faster, makes fewer mistakes, and can even learn new tricks that no human ever explicitly taught it to do. It's the difference between a robot that follows a rigid script and a robot that has a sense of style.

1. Problem Statement

The paper addresses the challenge of training autonomous drones to perform complex acrobatic maneuvers (e.g., powerloops, figure-8s) using Reinforcement Learning (RL).

The Reward Design Bottleneck: Traditional RL relies on manually engineered reward functions. For acrobatic flight, defining these rewards is difficult because desirable behaviors depend on subjective qualities like smoothness, timing, and visual style rather than simple geometric metrics.
Limitations of Hand-Crafted Rewards: The authors demonstrate that even carefully designed reward functions align with human judgment only 60.7% of the time. This misalignment leads to policies that may optimize for the wrong metrics (e.g., jerky movements that satisfy a mathematical formula but look poor to a human).
Uncertainty in Preferences: Existing Preference-based RL (PbRL) methods often treat preference labels as deterministic. However, when two trajectories are of similar quality, human (or synthetic) preferences are inherently noisy and uncertain. Ignoring this uncertainty leads to unstable training and brittle policies.

2. Methodology: Reward Ensemble under Confidence (REC)

The authors propose REC, a probabilistic framework for PbRL that explicitly models reward uncertainty to improve stability and exploration.

Core Components:

Distributional Reward Models (Ensemble):
- Instead of a single deterministic reward predictor, REC uses an ensemble of $N$ Multi-Layer Perceptrons (MLPs).
- Each member predicts a scalar reward. The ensemble statistics (mean $\mu$ and standard deviation $\sigma$ ) are used to model the reward at each timestep as a Gaussian distribution: $r \sim \mathcal{N}(\mu, \sigma)$ .
- This allows the system to quantify uncertainty: high $\sigma$ indicates the model is unsure about the reward for a specific state-action pair.
Probabilistic Preference Loss:
- The standard Bradley-Terry model (softmax) is replaced with a Gaussian Cumulative Distribution Function (CDF).
- The probability that trajectory $\tau_1$ is preferred over $\tau_2$ is calculated based on the difference in their mean rewards and the sum of their variances:
  $P(\tau_1 > \tau_2) = \Phi\left(\frac{\mu(\tau_1) - \mu(\tau_2)}{\sqrt{\sigma(\tau_1)^2 + \sigma(\tau_2)^2}}\right)$
- This formulation naturally incorporates uncertainty; if the model is uncertain (high variance), the preference probability moves closer to 0.5, preventing overconfident updates on ambiguous data.
Uncertainty-Aware Reward Aggregation:
- To drive exploration, the reward signal for policy optimization is augmented with a "noise bonus" derived from ensemble disagreement.
- The aggregated reward includes an absolute value of a random variable drawn from a distribution centered on the ensemble's variance. This encourages the agent to visit states where the reward model is uncertain (high disagreement), facilitating better exploration in complex dynamics.
Ensemble Resetting Mechanism:
- To prevent ensemble members from collapsing into identical predictions (which would eliminate uncertainty estimates), the worst-performing members of the ensemble are re-initialized before each retraining phase. This maintains diversity and ensures robust uncertainty estimation.

3. Key Contributions

REC Framework: A novel probabilistic PbRL framework that models per-timestep reward uncertainty via an ensemble and propagates it through a Gaussian CDF preference model.
Performance Improvement: REC achieves 88.4% of the performance of a hand-crafted reward baseline on acrobatic quadrotor control, significantly outperforming standard Preference PPO (55.2%).
Sim-to-Real Transfer: The authors successfully transfer policies trained purely on preference feedback (both synthetic and human) to a real 220g quadrotor without any fine-tuning (zero-shot).
Discovery of Novel Skills: Using only human preference feedback, the system learned a vertical Figure-8 (double powerloop), a maneuver not explicitly defined in the reward function.
Empirical Validation of Reward Limitations: The study quantifies the gap between manual reward engineering and human judgment (60.7% agreement), providing strong evidence for the necessity of preference-based approaches in subjective tasks.

4. Experimental Results

Simulation (Quadrotor):
- Task: Continuous powerloop.
- Metric: REC achieved a mean evaluation reward of 382.4, compared to 238.9 for standard Preference PPO.
- Stability: REC showed significantly lower variance across training seeds, indicating more reliable convergence.
Simulation (Continuous Control Benchmark):
- Tested on the Walker-Walk task from DM Control Suite.
- REC components (probabilistic loss and reward noise) consistently improved performance over the baseline, with ensemble resetting reducing variance.
Real-World Deployment:
- Policies were deployed on a physical drone.
- Human Preference: A policy trained on 1,000 human-labeled trajectory pairs successfully executed continuous powerloops on the real drone.
- Novel Skill: A policy trained on human preferences for a "vertical Figure-8" successfully executed the maneuver on the real hardware, demonstrating the framework's ability to learn complex, non-intuitive skills without explicit reward engineering.

5. Significance and Conclusion

This work represents a significant step forward in autonomous aerial robotics and preference-based learning.

Bridging the Gap: It demonstrates that PbRL is not just a theoretical alternative but a practical solution for high-dynamic, real-world tasks where defining reward functions is intractable.
Handling Subjectivity: By modeling uncertainty, REC handles the inherent noise in human judgment, making it robust for tasks where "good" is subjective (e.g., aesthetics of flight).
Generalizability: The success on both aerial robotics and standard continuous control benchmarks suggests that REC is a generalizable method for any domain requiring complex, subjective policy learning.
Future Impact: The ability to learn complex acrobatic skills purely from comparative feedback (even from non-expert humans) lowers the barrier to entry for developing advanced autonomous flight behaviors, removing the need for expensive and time-consuming reward engineering.

Learning Acrobatic Flight from Preferences

The New Idea: "The Taste Test"

The Problem with the "Taste Test"

The Solution: REC (The "Confident Committee")

The Results: From Simulation to Real Life

Why This Matters

1. Problem Statement

2. Methodology: Reward Ensemble under Confidence (REC)

Core Components:

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Robust Multi-agent Communication via Multi-view Message Certification

DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method

Forecasting Supply Chain Disruptions with Foresight Learning

UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression