3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model

Imagine you are trying to teach a robot how to catch a rogue drone flying in the sky. To do this, the robot needs to know exactly where the drone is in 3D space (up/down, left/right, forward/back) and what kind of drone it is.

Usually, to teach a robot this, you need a team of expensive engineers with high-tech laser scanners and a lot of time to manually label thousands of videos. It's like trying to teach a child to drive by having a professional instructor sit in the passenger seat for every single mile they drive. It's accurate, but it's incredibly expensive and slow.

This paper proposes a cheaper, faster, and smarter way: "Let the internet teach the robot."

Here is how their new system works, broken down into three simple steps using everyday analogies:

1. The "Smart Librarian" (Language-Driven Data Acquisition)

Imagine you have a massive library of videos from YouTube, TikTok, and other sites. Most of these videos are useless for your robot: some are shaky "selfie" videos, some are tutorials, and some don't even show a drone.

Instead of hiring humans to watch every video, the authors use a Smart Librarian (an AI language model).

The Search: The Librarian asks the internet, "Show me videos of drones flying."
The Filter: The Librarian then uses a "Vision-Language" assistant (an AI that can see and read) to look at the videos. It asks: "Is the drone clearly visible? Is the camera steady, or is the person holding the camera running around?"
The Result: It throws away the shaky, confusing videos and keeps only the clear, steady shots of drones. It's like a bouncer at a club who only lets in the people who fit the dress code.

2. The "Detective Team" (Training-Free Cross-Modal Label Generation)

Now that we have good videos, we need to guess the drone's 3D path and type without ever having seen a labeled dataset before.

The Detective Squad: Instead of relying on one detective, they use a team of three different AI "experts" (detection models). They all look at the same video frame.
- Expert A says, "I see a box here."
- Expert B says, "I see a box there."
- Expert C says, "I see a box right in the middle."
The Consensus: If at least two experts agree on where the drone is, the system trusts them. It averages their guesses to get a very accurate 2D position.
The Size Guess: The system then asks a powerful AI (like a super-smart chatbot), "Based on what this drone looks like, how big is it in real life?"
The 3D Leap: By knowing how big the drone should be and how big it looks on the screen, the system can mathematically guess how far away it is (depth). It's like judging how far away a car is by looking at how small its taillights appear.

3. The "Physics Coach" (Physics-Informed Refinement)

The guesses from the "Detective Team" are good, but they might be a little jittery or wobbly, like a shaky hand drawing a line.

The Coach: The system brings in a Physics Coach. This coach knows the laws of physics: "Drones can't teleport. They can't turn 90 degrees instantly. They have momentum."
The Correction: The Coach smooths out the wobbly line. If the AI guessed the drone jumped 10 feet in a split second, the Coach says, "No, that's impossible. Let's adjust the path to make it look like a real, smooth flight."
The Result: A clean, realistic 3D flight path that respects the laws of motion.

The Big Surprise: "The More, The Merrier"

The most exciting part of this paper is what happened when they fed the system more internet videos.

Usually, in AI, if you don't train a model on the specific data you want to test it on, it fails. But here, they tested their system on a famous, high-quality dataset (MMAUD) that they had never seen before.

The Scaling Effect: As they added more and more internet videos to their training pool (from a few hours to 200,000 seconds of video), the system got better and better at guessing the 3D paths on the test dataset.
The Analogy: It's like a student who has never taken a specific math test but has read 10,000 math books. When they finally take the test, they do almost as well as the student who memorized the specific test answers.

Why This Matters

This method is a game-changer because:

It's Free: It uses videos already on the internet.
It's Fast: No humans need to manually label thousands of hours of video.
It Works: It performs almost as well as the most expensive, high-tech systems currently available, making it possible to build better anti-drone defense systems for the real world without breaking the bank.

In short, they built a system that learns to catch drones by watching millions of YouTube videos, using AI to filter the noise, a team of AI detectives to find the targets, and a physics coach to make sure the flight paths make sense.

Here is a detailed technical summary of the paper "3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model."

1. Problem Statement

The accurate estimation of Unmanned Aerial Vehicle (UAV) trajectories in 3D space is critical for anti-UAV defense systems (e.g., interception planning and risk assessment). However, the development of such systems is severely constrained by the lack of large-scale, high-quality, annotated 3D trajectory datasets.

Current Limitations: Existing datasets (e.g., MMAUD) rely on expensive hardware (LiDAR, high-precision surveying equipment) and labor-intensive manual annotation, making them prohibitively expensive to scale.
The Gap: While vast amounts of UAV-related video exist on the internet, they are unannotated, noisy (containing first-person views or camera motion), and lack 3D depth information.
Goal: To develop a framework that can autonomously convert raw, internet-scale UAV videos into reliable 3D trajectory and category labels without manual annotation or specialized sensors, enabling zero-shot transfer to real-world benchmarks.

2. Methodology

The proposed framework operates through three sequential, interconnected modules:

A. Language-Driven Data Acquisition

Instead of random scraping, the system uses an agentic Large Language Model (LLM) to discover and filter videos.

Discovery: The LLM issues generic textual queries (e.g., "UAV flying") to retrieve videos from platforms like YouTube and TikTok.
Progressive Filtering: A Vision-Language Model (VLM) performs a two-stage filter:
1. Visibility Check: Computes relevance scores between video frames and prompts describing visible UAVs. Frames with low relevance or high ambiguity are discarded.
2. Viewpoint Check: Distinguishes between static-view (stable background, UAV moving) and dynamic-view (camera moving, obscuring UAV dynamics) sequences. Only static-view clips are retained to ensure reliable trajectory observation.

B. Training-Free Cross-Modal Label Generation

This module generates 2D bounding boxes, category labels, and coarse depth cues without training a specific model on the target data.

2D Trajectory & Detection: Uses a Mixture-of-Experts (MoE) strategy combining multiple pre-trained detectors (e.g., Grounding SAM, lightweight UAV detectors, and benchmark models).
- Candidate bounding boxes are clustered based on Intersection-over-Union (IoU).
- Only clusters supported by at least two experts are retained to ensure robustness.
- The final 2D trajectory is derived by averaging the boxes in the winning cluster and smoothing with cubic B-splines.
Classification: A VLM classifies the cropped UAV region. A majority voting mechanism over a temporal window ensures clip-level label consistency.
Depth Estimation: The VLM estimates the physical height ( $H_{real}$ ) of the UAV based on its category. Combined with the 2D bounding box height ( $h_t$ ) and camera focal length ( $f_y$ ), a coarse monocular depth is calculated: $\hat{z}_t = (f_y \cdot H_{real}) / h_t$ .

C. Physics-Informed Refinement

To transform noisy frame-wise estimates into a physically plausible 3D trajectory, the system employs an Extended Kalman Filter (EKF).

State Definition: The latent state includes 3D position ( $X, Y, Z$ ) and velocity ( $V_x, V_y, V_z$ ).
Motion Model: Assumes near-constant velocity between frames.
Observation Model: Projects the 3D state to 2D image coordinates and pixel height.
Refinement: The EKF recursively fuses the coarse depth cues and 2D observations with motion priors. This enforces temporal smoothness and kinematic feasibility, correcting depth errors and jitter.

3. Key Contributions

Scalable Framework: A novel pipeline that derives 3D trajectories and class labels directly from internet videos without manual annotation or expensive sensors.
Language-Driven Acquisition: An autonomous agent system that uses LLMs and VLMs to filter internet noise, specifically isolating static-view UAV footage.
Training-Free Cross-Modal Generation: A method that leverages pre-trained vision-language models and geometric priors (physical size) to generate 3D pseudo-labels without supervised training.
Data Scaling Behavior: The paper demonstrates that as the volume of web-scale video data increases, the zero-shot transfer performance on target benchmarks improves consistently, establishing a new paradigm for data-efficient anti-UAV perception.

4. Results and Evaluation

The framework was evaluated via zero-shot transfer on the MMAUD benchmark (a high-quality 3D UAV dataset), without any training on the MMAUD data itself.

Performance Metrics:
- 3D Trajectory Error ( $e_{3D}$ ): The proposed method achieved 0.30 m, closely approaching the state-of-the-art (SOTA) supervised methods (e.g., AAUTE at 0.48 m, TAME at 0.55 m).
- Classification Accuracy: Achieved 96.0%, matching the best supervised audio-visual methods.
- Depth Improvement: The physics-informed refinement notably reduced depth error ( $D_z$ ) from 0.67 m (raw estimate) to 0.44 m.
Ablation Studies:
- Expert Mixture: Using 3 experts reduced error to 0.30 m, compared to 0.65–0.76 m for single experts.
- Data Scaling: Increasing the web video corpus to 200,000 seconds resulted in the best performance, confirming the data scaling law.
- Model Agnosticism: The framework performed consistently across different VLM backbones (CLIP, SigLIP, etc.) and LLMs (GPT-4o, Qwen), proving robustness.

5. Significance

Democratization of 3D Data: This work removes the barrier of expensive LiDAR and manual annotation, allowing researchers to build 3D anti-UAV systems using freely available internet data.
Zero-Shot Generalization: It proves that models trained (or rather, inferred) on diverse, noisy internet data can generalize effectively to controlled, high-precision benchmarks without fine-tuning.
Practical Applicability: The method offers a cost-effective solution for real-world anti-UAV deployment, where acquiring labeled 3D data is often impossible.
Future Paradigm: It establishes a "Language-Driven + Physics-Informed" paradigm for extracting structured 3D geometric data from unstructured 2D video, applicable beyond just UAVs.

In conclusion, the paper presents a robust, scalable, and cost-effective solution to the data scarcity problem in 3D UAV trajectory estimation, achieving near-SOTA performance through intelligent data mining and physics-based refinement.