Agile in the Face of Delay: Asynchronous End-to-End Learning for Real-World Aerial Navigation

Imagine you are driving a race car through a dense, twisting forest. You have two main tools:

Your eyes (Perception): They see the trees, but they are slow. Maybe you only get a clear, high-definition picture of the path every 10 times a second.
Your reflexes (Control): Your hands on the steering wheel need to make tiny, rapid adjustments 100 times a second to keep the car from crashing.

The Problem:
In most self-driving drones today, your reflexes are forced to wait for your eyes. If your eyes only update 10 times a second, your steering wheel can only move 10 times a second. This is like trying to drive a Formula 1 car while only allowed to turn the wheel once every second. You'd crash immediately because you can't react fast enough to sudden obstacles.

The Solution: "The Asynchronous Pilot"
This paper introduces a new way to fly drones that breaks this bottleneck. Instead of making the fast reflexes wait for the slow eyes, they are decoupled.

Here is how it works, using a simple analogy:

1. The "Stale Map" Problem

Imagine you are driving blindfolded, but every 0.1 seconds, a friend shouts out, "There's a tree 5 meters ahead!"

The Old Way: You wait for the friend to shout before you turn the wheel. If the friend is slow, you drive slowly.
The New Way: You turn the wheel 100 times a second based on your best guess. But here's the catch: The last time your friend shouted was 0.2 seconds ago. The tree might have moved, or you might be closer to it now. Your information is "stale."

2. The Secret Weapon: The "Time-Traveler's Hat" (Temporal Encoding Module)

This is the paper's biggest innovation. The drone doesn't just ignore the fact that its map is old. Instead, it wears a special "Time-Traveler's Hat" (called the Temporal Encoding Module).

How it works: The drone knows exactly how old its last picture is. If the last picture was 0.2 seconds old, the "Hat" tells the brain: "Hey, we are moving fast. If we were 5 meters from that tree 0.2 seconds ago, and we are flying at 2 meters per second, we are probably right next to it now."
The Result: The drone uses math to "predict" where the obstacles are right now, even though it hasn't seen them yet. It compensates for the delay, allowing it to fly fast and agilely without crashing.

3. The Training: "Practice with a Stopwatch"

You can't just teach a drone to fly this way instantly. It's like teaching a human to drive a race car while wearing a blindfold that opens and closes randomly.

Stage 1 (The Ideal World): First, the drone trains in a simulation where the "eyes" are perfect and instant. It learns the basics of not crashing.
Stage 2 (The Real World): Then, the simulation introduces the "delay." The eyes become slow. The drone has to learn to use its "Time-Traveler's Hat" to guess where things are. Because it already knows how to fly (from Stage 1), it quickly learns how to handle the lag.

The Real-World Test

The team tested this on a real drone (a quadcopter) flying through a cluttered forest and an indoor obstacle course.

The Hardware: The drone had a small computer on board (not a supercomputer).
The Sensors: It used a LiDAR sensor that updates slowly (10 times a second).
The Result: Despite the slow sensors, the drone's "brain" made steering decisions 100 times a second. It flew through dense trees, dodged obstacles, and didn't crash. It did this without any human tweaking after the simulation training (a "zero-shot" transfer).

Why This Matters

Previously, if you wanted a drone to fly fast and safely, you needed expensive, heavy computers and perfect sensors. This paper shows that you can use cheap, slow sensors and small computers if you teach the drone to understand time.

In a nutshell:
They taught a drone to drive fast by giving it a "time machine" in its brain. This allows it to guess where obstacles are right now, even though its "eyes" are looking at the past. This makes drones faster, safer, and cheaper to build.

1. Problem Statement

The core challenge addressed is the temporal mismatch between high-frequency control requirements and low-frequency perception in Autonomous Aerial Vehicles (AAVs).

The Conflict: Agile flight requires control loops running at high frequencies (e.g., 100 Hz) to react instantly to disturbances. However, perception sensors (LiDAR, cameras) have low native update rates (e.g., 10 Hz) and heavy computational costs for processing.
The Limitation of Current Methods: Conventional end-to-end reinforcement learning (RL) models typically operate synchronously, meaning the control policy updates only when new perception data arrives. This forces the control frequency down to the sensor rate, compromising agility and safety.
The Consequence: If the system decouples perception and control to run at high frequency, the policy must act on "stale" perception data (high Age of Information, or AoI). This creates partial observability, as the environment may have changed since the last sensor update, leading to decision-making errors.

2. Methodology

The authors propose an Asynchronous End-to-End Learning Framework that decouples the perception and control loops while explicitly modeling the resulting data staleness.

A. System Architecture

Low-Frequency Perception Pipeline: Raw LiDAR point clouds are converted into a structured 2D Pseudo-Image using spherical coordinate projection. A Convolutional Neural Network (CNN) processes this image to extract spatial features. This module runs at the sensor's native rate (e.g., 10 Hz).
High-Frequency Control Loop: The control policy runs at a high frequency (e.g., 100 Hz). It does not wait for new LiDAR data. Instead, it uses the latest available perception features combined with high-frequency IMU state estimates (position, velocity, orientation).
Temporal Encoding Module (TEM): To handle the "stale" perception data, the framework introduces a theoretically grounded TEM.
- It calculates the Age of Information (AoI) ( $\Delta t_{lidar}$ ), defined as the time elapsed since the last perception measurement.
- The AoI is encoded (using a sinusoidal encoder) and concatenated with the state vector as an explicit input to the policy.
- Theoretical Basis: By conditioning the policy on the explicit delay, the network learns to predict how the environment has likely changed, reducing the conditional entropy of the state estimation and compensating for partial observability.

B. Training Strategy (Two-Stage Curriculum)

To ensure stable training despite the asynchronous nature, a two-stage curriculum is employed:

Synchronous Training Stage: The agent is trained in a simulator with ideal, high-frequency perception (AoI = 0). This establishes a robust baseline navigation capability.
Asynchronous Training Stage: The training shifts to a realistic setting where perception is low-frequency (AoI > 0 and time-varying). The policy learns to utilize the TEM to adapt to varying delays, leveraging the "warm start" from the first stage.

C. Reward Function

The RL objective maximizes cumulative rewards composed of:

Static Safety: Penalizes proximity to obstacles based on LiDAR data.
Velocity: Encourages moving toward the goal at a desired speed.
Constraints: Penalties for altitude deviation and excessive pitch/roll angles.
Terminal Rewards: Positive reward for reaching the goal; negative penalties for collisions or boundary violations.

3. Key Contributions

Novel Asynchronous Architecture: A decoupled end-to-end network that enables high-frequency control (100 Hz) using low-frequency sensors, breaking the traditional bottleneck where control rate is limited by sensor rate.
Temporal Encoding Module (TEM): A principled mechanism that explicitly encodes the "Age of Information" (AoI) into the policy input. This allows the agent to reason about data staleness and compensate for partial observability without relying solely on implicit memory.
Two-Stage Curriculum Learning: A training strategy that transitions from synchronous to asynchronous learning, ensuring stable convergence and enabling successful zero-shot sim-to-real transfer.
Efficient Perception Processing: A lightweight LiDAR processing module (Pseudo-Image generation) that minimizes latency, making real-time onboard inference feasible on resource-constrained hardware.

4. Results and Validation

Simulation Benchmarks

Robustness to Sensor Frequency: In a dense forest environment (0.2 obstacles/m²), the proposed method achieved a 91.08% success rate at 10 Hz perception, compared to 93.67% at 100 Hz. This represents a negligible 2.6% drop.
Comparison: Competing synchronous methods (e.g., NavRL) suffered an 11.6% performance drop when moving from 100 Hz to 10 Hz.
Ablation Studies: Removing the TEM caused a significant performance drop (8.4–9.7%), proving its critical role in handling delays. The asynchronous method outperformed synchronous baselines by over 14 percentage points in high-speed, high-density scenarios.

Real-World Deployment (Zero-Shot Sim-to-Real)

Platform: Deployed on a custom quadrotor with an Intel NUC 13 and NVIDIA Jetson Orin NX, using a Livox Mid-360 LiDAR (10 Hz).
Performance: The drone successfully navigated cluttered indoor spaces and dense outdoor forests at an average speed of 1.3 m/s (max 2.0 m/s) without any fine-tuning.
Control Rate: The system maintained a stable 100 Hz control loop on the onboard computer, despite the 10 Hz sensor input.
Latency: The total onboard processing latency was approximately 1.15 ms for perception and 0.27 ms for control policy, well within the 10 ms budget for 100 Hz operation.

5. Significance

This work addresses a fundamental structural limitation in autonomous robotics: the trade-off between computational cost/sensor rates and control agility.

Practical Impact: It enables the deployment of agile, end-to-end navigation on small, resource-constrained aerial vehicles (AAVs) without requiring expensive, high-bandwidth sensors or massive onboard compute.
Theoretical Advancement: It demonstrates that explicitly modeling the "Age of Information" allows agents to overcome the partial observability caused by asynchronous data streams, a problem previously often ignored in end-to-end learning.
Real-World Viability: The successful zero-shot transfer to physical hardware in unstructured environments (forests, indoor clutter) validates the framework's robustness and readiness for real-world applications.

Limitations: The current policy is purely reactive and lacks explicit trajectory prediction for high-speed dynamic agents. Additionally, sim-to-real gaps at very high speeds remain a challenge requiring better system identification.