OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

OnFly is a fully onboard, real-time framework for zero-shot aerial vision-language navigation that employs a shared-perception dual-agent architecture, hybrid memory, and semantic-geometric verification to overcome existing limitations in decision stability and safety-efficiency trade-offs, achieving a task success rate of 67.8% in simulation.

Guiyong Zheng, Yueting Ban, Mingjie Zhang, Juepeng Zheng, Boyu Zhou

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are teaching a drone to fly through a complex, unfamiliar city based only on a spoken sentence like, "Fly to the red mailbox, then circle the fountain, and stop near the blue bench."

This is the challenge of Aerial Vision-Language Navigation (AVLN). The drone needs to "see" the world, "understand" your words, and "fly" safely without crashing.

The paper introduces OnFly, a new system that acts like a super-smart, self-contained pilot for drones. Here is how it works, broken down into simple concepts and analogies.

The Problem: The "Overworked Driver"

Previous attempts to make drones do this had three main flaws, like a driver who is trying to do too many things at once:

  1. The "Multitasking Mess": Old systems tried to do two very different jobs at the same time: steering the wheel (high-speed decisions) and checking the GPS map (slow, big-picture planning). Because these jobs happen at different speeds, the system got confused, stuttered, and made bad decisions.
  2. The "Short Memory": To know if it's making progress, the drone needed to remember the whole trip. But old systems used a "sliding window" memory—like a tape recorder that constantly erases the beginning of the song to make room for new notes. Eventually, the drone forgot where it started, got lost, and couldn't tell when to stop.
  3. The "Safety vs. Speed" Trap: To be safe, drones would hover and wait (very slow). To be fast, they would fly blindly toward a target (very dangerous). They couldn't be both safe and efficient.

The Solution: OnFly

OnFly solves these problems with three clever tricks, turning the drone into a highly organized team rather than a confused individual.

1. The "Dual-Agent" Team (The Driver and the Navigator)

Instead of one brain trying to do everything, OnFly splits the work into two specialized agents that share the same eyes but have different jobs:

  • The Driver (High-Frequency Agent): This agent is the reflex. It looks at the camera and instantly decides, "Turn left now," or "Go forward." It doesn't worry about the big picture; it just keeps the drone moving smoothly.
  • The Navigator (Low-Frequency Agent): This agent is the strategist. It checks the map every few seconds to ask, "Are we close to the mailbox yet? Did we get lost?"
  • The Magic: Because they don't fight for the same computer resources, the drone never stutters. The Driver keeps the plane steady, while the Navigator ensures they are going the right way.

2. The "Hybrid Memory" (The Photo Album vs. The Scrapbook)

To remember the whole journey without getting confused, OnFly uses a special memory system:

  • The Old Way: A "Sliding Window" is like a scrapbook where you keep tearing out old pages to add new ones. You lose the context of the start of the trip.

  • The OnFly Way: It uses a Hybrid Memory. Imagine a photo album that always keeps:

    1. The First Photo (where you started).
    2. Key Snapshots (important landmarks you passed).
    3. The Current View (what you see right now).

    This way, the drone never forgets where it began, but it also doesn't get bogged down by remembering every single second of the flight. It keeps the "prefix" of the memory stable, which makes the computer think much faster.

3. The "Safety Double-Check" (The Spotter and the Pilot)

When the AI suggests a target (e.g., "Fly to that tree"), it might be wrong. Maybe the tree is behind a wall, or the depth is wrong. OnFly adds a safety layer:

  • The Semantic-Geometric Verifier: Before the drone flies, this module acts like a spotter. It checks: "Does that tree actually look like the one in the instructions? Is there a wall in the way?" If the AI is hallucinating, this step corrects the target.
  • The Receding-Horizon Planner: Once the target is safe, a planner calculates the smoothest, crash-free path to get there, avoiding obstacles like a skilled pilot weaving through traffic.

The Results: From "Clumsy" to "Pro"

The researchers tested OnFly in a virtual world and then on a real drone flying outdoors.

  • Success Rate: Old methods succeeded only about 26% of the time. OnFly succeeded 68% of the time.
  • Safety: It crashed significantly less often.
  • Speed: It flew more efficiently, not stopping and starting unnecessarily.

The Bottom Line

OnFly is like giving a drone a dedicated driver, a dedicated navigator, a perfect photo album memory, and a safety spotter, all running on a small computer attached to the drone itself. It allows drones to follow complex human instructions in the real world safely and efficiently, without needing a giant computer in the cloud to tell them what to do.