Hypernetwork-Conditioned Reinforcement Learning for Robust Control of Fixed-Wing Aircraft under Actuator Failures

Imagine you are teaching a drone to fly a specific path, like a pilot following a race track. Usually, you train the drone in a simulator where everything works perfectly: the wings flap just right, the rudder turns smoothly, and the wind is predictable.

But in the real world, things go wrong. A wing might get stuck halfway open, a rudder might jam, or the wind might suddenly change direction. A standard "smart" drone (trained with standard Reinforcement Learning) is like a student who memorized the answers to a specific math test. If you ask them a slightly different question, they panic and fail. They try to force the same old solution onto a broken machine, which often leads to a crash.

This paper introduces a new way to train drones so they don't just memorize, but adapt on the fly. Here is how they did it, using some simple analogies.

1. The Problem: The "One-Size-Fits-All" Brain

Standard AI controllers are like a single, rigid brain. It has one set of rules for everything.

The Issue: If the drone's right wing gets stuck, the physics of flight change completely. The "rigid brain" tries to use its old rules, which are now wrong. It's like trying to drive a car with a flat tire using the same steering technique you use on smooth pavement. You end up spinning out of control.
The Old Solution: You could build a different brain for every possible failure (a brain for a stuck left wing, a brain for a stuck rudder, etc.). But there are too many ways a drone can break, so you'd need thousands of brains, which is too heavy and slow for a small drone.

2. The Solution: The "Swiss Army Knife" Brain (Hypernetworks)

The authors created a drone controller that acts like a Swiss Army Knife or a chameleon.

Instead of having one fixed brain, they built a Main Brain (the pilot) and a Smart Adapter (the hypernetwork).

The Main Brain: This is the part that actually flies the plane. It's good at flying, but it needs instructions on how to fly given the current situation.
The Smart Adapter: This is a tiny, fast computer that looks at the problem (e.g., "Oh, the right rudder is stuck at 50%") and instantly tweaks the Main Brain's settings.

Think of it like a guitarist.

The Main Brain is the guitarist's hands and muscle memory.
The Smart Adapter is the guitarist looking at the sheet music and saying, "Okay, today we are playing in the key of C, so I need to shift my fingers slightly."
If the music changes to the key of G (a different failure), the adapter instantly tells the hands to shift again. The hands don't need to learn a new song; they just adjust their position.

3. The Two Tricks: FiLM and LoRA

The paper tests two specific ways to make this "adapter" work efficiently. They call them FiLM and LoRA.

FiLM (Feature-wise Linear Modulation): Imagine the Main Brain is a painting. FiLM is like a filter you slide over the painting. It doesn't repaint the whole thing; it just brightens the colors or shifts the contrast in specific areas to match the broken wing. It's a quick, lightweight adjustment.
LoRA (Low-Rank Adaptation): Imagine the Main Brain is a complex machine with millions of gears. LoRA is like adding a small, detachable gear to the machine. Instead of rebuilding the whole engine, you just snap on a tiny extra gear that changes how the engine handles the broken wing. It's very efficient and uses very little space.

4. The Training: Learning to Handle "Flutter"

The researchers didn't just train the drone on broken wings; they trained it on chaos.

Static Failures: The wing gets stuck and stays stuck. (Easy to predict).
Flutter (The Real Test): The wing starts shaking, jamming, and un-jamming rapidly, like a butterfly flapping its wings. This is a nightmare for standard AI.

The Results:

The Standard Drone (MLP): When the rudder started "fluttering," the standard drone panicked. It tried to over-correct, spun out, and crashed (or flew 160 meters off course). It was like a driver trying to steer a car with a shaking steering wheel by gripping it tighter and tighter until they broke the wheel.
The Hypernetwork Drone: When the rudder started shaking, the "Smart Adapter" instantly noticed the change. It tweaked the Main Brain's settings to compensate for the shaking. The drone wobbled a bit but stayed on the path, never losing control. It was like a driver who feels the wheel shaking and instinctively loosens their grip and steers with their hips to stay steady.

5. Why This Matters

This paper proves that by giving AI a "Smart Adapter," we can make drones (and potentially other robots) much safer.

Efficiency: The adapter is tiny. It doesn't make the drone heavy or slow.
Generalization: The drone didn't just learn to handle one broken wing; it learned how to handle any broken wing, even ones it had never seen before, and even ones that were shaking uncontrollably.

In a nutshell:
Instead of teaching a robot to memorize every possible disaster, the authors taught it how to adapt. They gave it a "chameleon brain" that can instantly reconfigure itself when things go wrong, keeping the aircraft safe even when the hardware is failing.

1. Problem Statement

The paper addresses the challenge of designing robust control policies for small uncrewed aircraft systems (sUAS) that can withstand actuator failures (e.g., stuck control surfaces like rudders or ailerons).

The Core Issue: Standard Reinforcement Learning (RL) policies, typically implemented as Multilayer Perceptrons (MLPs), often struggle to generalize across varying system dynamics. When an actuator fails, the system's dynamics change structurally (e.g., altered coupling between yaw, roll, and lateral velocity).
Gradient Interference: Training a single MLP to handle both nominal and failed states often leads to "gradient interference," where updates from one operating regime conflict with another, resulting in conservative solutions or unstable training.
Limitations of Switching: Using a set of discrete controllers with switching logic is impractical because the failure space is continuous and high-dimensional, leading to an explosion in the number of required modes.
Goal: Develop a control framework that can adapt to static (stuck) and time-varying (fluttering) actuator failures, including scenarios not explicitly seen during training (out-of-distribution generalization).

2. Methodology

The authors propose a Hypernetwork-Conditioned RL framework where a secondary network (the hypernetwork) generates adaptation parameters for the main control policy based on the current actuator fault status.

A. Hypernetwork Architectures

Instead of generating full network weights (computationally expensive), the paper utilizes parameter-efficient adaptation methods commonly used in Large Language Models (LLMs):

Feature-wise Linear Modulation (FiLM): The hypernetwork generates scaling ( $p_{scale}$ ) and shifting ( $p_{shift}$ ) vectors that apply affine transformations to the intermediate activations of the main policy network.
Low-Rank Adaptation (LoRA): The hypernetwork generates low-rank matrices ( $U, V$ ) and a diagonal vector ( $r$ ) to update the weight matrices of the main network ( $W \to W + U \text{diag}(r)V^T$ ). The rank ( $n_r$ ) controls the adaptation capacity.

B. Training Framework

Algorithm: Proximal Policy Optimization (PPO).
Joint Training: Unlike LLM fine-tuning where the base model is frozen, both the main policy and the hypernetwork are trained jointly from scratch.
Conditioning Input: The hypernetwork takes a vector $\lambda_k$ $λ_{k}$ representing actuator faults (binary failure status and stuck deflection level) as input.
- For MLP baselines, $\lambda_k$ is appended to the observation vector.
- For Hypernetwork policies, $\lambda_k$ conditions the weights and is not part of the main network's observation input.
Value Function Conditioning: The authors investigate whether the value function (critic) should also be conditioned by the hypernetwork (+HC) or remain a standard MLP.

C. Simulation Environment

Platform: A high-fidelity 6-DoF nonlinear model of the CZ-150 fixed-wing sUAS.
Disturbances: Includes Gaussian sensor noise, wind gusts (Dryden turbulence), and stochastic aerodynamic coefficient perturbations.
Failure Scenarios:
- Static: Actuators stuck at specific deflection levels (training on a discrete grid).
- Flutter (Out-of-Distribution): Time-varying, non-stationary oscillatory failures not seen during training to test generalization.

3. Key Contributions

Novel Framework: Introduction of a hypernetwork-conditioned RL framework specifically for robust path-following under actuator failures in fixed-wing sUAS.
Superior Generalization: Demonstration that hypernetwork-conditioned policies significantly outperform standard MLPs, particularly in time-varying (flutter) failure modes where MLPs often diverge catastrophically.
Adaptation Analysis:
- FiLM: Shows that conditioning the value function (critic) significantly improves performance (40–50% error reduction).
- LoRA: Shows that conditioning the critic degrades performance due to optimization complexity; a standard input-feature representation for the critic is more robust.
- Rank Sensitivity: Identifies that LoRA performance is sensitive to rank selection, with higher ranks ( $n_r=64$ ) generally providing better generalization, though non-monotonic instability can occur.
Design Insights: Practical guidelines on observation selection, failure parameterization, and reward shaping (noting that sparse rewards degrade MLPs but have negligible impact on FiLM policies).

4. Key Results

Experiments were conducted on 1,000 episodes per set, comparing MLP, FiLM, and LoRA policies.

Static Failures: All architectures maintained stability. However, hypernetwork policies showed lower worst-case errors (e.g., MLP rudder error: 36.83m vs. FiLM: 21.34m).
Flutter (Time-Varying) Failures:
- MLP: Exhibited catastrophic divergence. For rudder flutter, the worst-case path error reached 159.91 m, and the standard deviation was over 5 times higher than hypernetwork policies.
- Hypernetworks: Maintained stability with worst-case errors below 30 m. They successfully generalized to failure dynamics not encountered during training.
Value Function Conditioning:
- FiLM + HC: Reduced errors by ~50% compared to unconditioned FiLM.
- LoRA + HC: Caused significant performance degradation (nearly doubling errors), suggesting that low-rank updates to both actor and critic simultaneously introduce optimization instability.
Computational Efficiency:
- Hypernetwork policies have <35k parameters (an order of magnitude fewer than a full weight-generator hypernetwork).
- Training time per iteration is comparable to MLPs.
- Inference cost is negligible (104–105 FLOPs), making it deployable on low-cost processors (e.g., Raspberry Pi).
Lipschitz Analysis: Higher LoRA ranks correlated with lower Lipschitz constants (more stable mappings) and better tracking performance.

5. Significance

This work bridges the gap between large-scale machine learning adaptation techniques (FiLM/LoRA) and robust control for safety-critical aerospace systems.

Robustness: It proves that conditioning policies on fault parameters allows the controller to learn a "family" of specialized strategies rather than a single compromised policy, effectively handling structural changes in system dynamics.
Generalization: The ability to handle unseen, time-varying failures (flutter) is a critical step toward deploying RL in real-world environments where failure modes are unpredictable.
Efficiency: By using parameter-efficient adaptation, the approach avoids the computational overhead of generating full network weights, making it viable for real-time embedded flight control.
Future Impact: The findings suggest that for RL-based control, the architecture of the adaptation mechanism (e.g., whether to condition the critic) is as critical as the policy architecture itself. Future work will focus on flight tests and spectral normalization to further constrain sensitivity.