Openfly: A comprehensive platform for aerial vision-language navigation

To address the lack of benchmarks for outdoor aerial Vision-Language Navigation, this paper introduces OpenFly, a comprehensive platform featuring a multi-engine simulation environment, an automated data collection toolchain, a large-scale 100k-trajectory dataset, and a specialized keyframe-aware agent model.

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Xuelong Li, Zhigang Wang, Bin Zhao

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you want to teach a drone to be a delivery driver or a search-and-rescue hero. You can't just tell it, "Go to the red building." You need to give it a full set of instructions like, "Fly up, turn left past the park, and stop when you see the big clock tower."

The problem? Teaching a drone this way is incredibly hard, expensive, and slow. Usually, humans have to manually fly the drone, record the path, and then write down the instructions for every single trip. It's like trying to teach a child to ride a bike by manually pushing them around the block for 100,000 different routes.

Enter "OpenFly." Think of OpenFly as a massive, automated "Drone School" simulator that solves all these problems at once.

Here is how it works, broken down into simple concepts:

1. The "World Builder" (The Rendering Engines)

To teach a drone, you need a place for it to fly. Most previous projects only had one or two digital worlds (like a single video game map).

  • The Analogy: Imagine trying to teach a driver to navigate only in a small, empty parking lot. They'd fail in the real city.
  • The OpenFly Solution: OpenFly is like a universal travel agency. It combines four different "worlds" into one platform:
    • Unreal Engine: High-tech, realistic city simulations.
    • GTA V: The famous video game world (Los Santos), which is surprisingly realistic.
    • Google Earth: Real satellite maps of cities like New York and Tokyo.
    • 3D Gaussian Splatting: A fancy new tech that takes real photos of actual university campuses and turns them into 3D digital twins.
  • The Result: The drone gets to practice in 18 different "cities," ranging from digital skyscrapers to real-world parks, ensuring it doesn't get confused when it sees something new.

2. The "Robot Teacher" (The Automatic Toolchain)

This is the magic part. Instead of humans manually flying the drone and writing instructions, OpenFly has a robotic assembly line.

  • Step 1: The Map Maker: The system scans the digital world and builds a 3D map of everything (buildings, trees, roads).
  • Step 2: The Path Finder: It automatically picks a starting point and a destination (like a "landmark"), then calculates a safe, collision-free flight path. It's like a GPS that instantly draws a line through a maze without hitting the walls.
  • Step 3: The Storyteller: This is the coolest part. The system takes the flight path and the images the drone "sees," then feeds them to an AI (like a super-smart chatbot). The AI looks at the path and writes the instructions: "Fly up, turn right toward the blue building, then go straight to the tower."
  • The Analogy: It's like having a robot director who films a movie, writes the script, and edits the scenes all by itself in seconds. In the past, humans had to do all three jobs.

3. The "Library" (The Dataset)

Because the "Robot Teacher" works so fast, OpenFly has built the largest library of drone flight data in the world.

  • The Scale: It contains 100,000 flight paths. Previous libraries only had about 10,000.
  • The Variety: It covers 15,600 different words and phrases, so the drone learns to understand "the tall glass building" just as well as "the skyscraper with the blue roof."

4. The "Star Student" (OpenFly-Agent)

Having a library is great, but you need a student to learn from it. The authors created a new AI model called OpenFly-Agent.

  • The Problem: When a drone flies, it sees thousands of images. If you show a brain every single frame of a video, it gets overwhelmed and forgets the important stuff (like the landmark it's looking for).
  • The Solution: OpenFly-Agent is like a smart camera operator. Instead of recording every single second, it knows exactly when to hit "record."
    • It ignores the boring parts where the drone is just flying straight.
    • It focuses intensely on the Keyframes—the moments when the drone sees a landmark or needs to turn.
  • The Result: This makes the AI faster, smarter, and much better at following instructions without getting confused by visual clutter.

Why Does This Matter?

Before OpenFly, teaching drones to navigate was like trying to learn a language by reading a dictionary one word at a time. It was slow and limited.

OpenFly is like a full immersion language camp.

  • It provides the environment (the world).
  • It provides the lessons (the data).
  • It provides the teacher (the AI model).

The result? Drones that can actually understand complex human instructions and navigate real-world cities safely. The authors tested their system in real life, and it worked significantly better than any previous method, proving that this "automated school" approach is the future of flying robots.

In short: OpenFly built a massive, automated factory that creates millions of flight lessons for drones, and a smart student that learns from them, so we can finally have drones that can fly themselves to deliver packages or find lost hikers without crashing.