Perception-to-Pursuit: Track-Centric Temporal Reasoning for Open-World Drone Detection and Autonomous Chasing

Imagine you are playing a high-stakes game of tag with a drone. You are the chaser, and the drone is the runner.

Most current computer systems are like bad referees. They can tell you exactly where the runner was a second ago, and they can guess where they might be next based on a simple straight line. But here's the problem: if the runner suddenly does a sharp U-turn or speeds up, that straight-line guess is useless. Even worse, the system might predict a spot that is physically impossible for you to reach in time, even if your guess about their location was "mathematically correct."

This paper introduces a new system called Perception-to-Pursuit (P2P). Think of it as upgrading your brain from a "guessing machine" to a tactical coach.

Here is how it works, broken down into simple concepts:

1. The "Motion Token" (The Secret Language)

Instead of looking at the drone like a picture (which is heavy and full of unnecessary details like background trees), P2P translates the drone's movement into a compact 8-word sentence.

The Words: Where it is, how fast it's going, how fast it's speeding up (acceleration), how big it looks, and how smooth its path is.
The Analogy: Imagine trying to describe a dancer. A bad description says, "She is wearing a red dress and standing on a stage." A good description says, "She is spinning fast, accelerating to the left, and her movements are jerky." P2P speaks the language of movement, not just looks.

2. The "Time-Traveling Coach" (The Transformer)

The system uses a special AI brain (a Transformer) that looks at the last 12 frames of video (about half a second of history).

The Analogy: Think of a baseball pitcher. If you only look at where the ball is right now, you can't tell if it's a curveball or a fastball. But if you watch the pitcher's arm motion for the last split second, you can predict the curve.
P2P watches the drone's "arm motion" (its acceleration and turning patterns) to predict if it's about to dodge, hover, or speed up. It doesn't just guess a straight line; it guesses the intent.

3. The "Reality Check" (The New Scorecard)

This is the most important part. The authors realized that being "accurate" isn't enough. You need to be actionable.

The Old Way: "I predict the drone will be at the top of the mountain in 5 seconds." (Great prediction! But your interceptor drone can't fly that fast, so you can't catch it. The prediction is useless.)
The New Way (ISR Metric): The system introduces a new score called Intercept Success Rate (ISR). It asks: "Given my drone's top speed and turning limits, can I actually catch the target at this predicted spot?"
The Result: Old systems were wrong about catchability 99.9% of the time. P2P gets it right 60% of the time. That is a massive leap from "theoretically possible" to "actually doable."

4. The "Open-World" Superpower

Usually, AI needs to be trained on specific pictures of drones to recognize them. If it sees a new type of drone it's never seen before, it gets confused.

The Analogy: P2P is like a police officer who doesn't need to know the suspect's face. They just know that "only a drone moves like that." Because it focuses entirely on the motion pattern (how it hovers, turns, and accelerates), it can identify any drone, even ones it has never seen before, with 100% accuracy.

The Bottom Line

The paper solves a critical gap in autonomous defense.

Before: "I see the drone. I know where it will be. Good luck catching it." (Often impossible).
After (P2P): "I see the drone. I know it's about to dodge left. I know my drone can physically reach that spot in time. Let's go!"

It turns a passive observation system into an active, feasible pursuit system, ensuring that when the computer says "catch it," it actually means "you can catch it."

1. Problem Statement

The paper addresses a critical gap in autonomous counter-drone systems: the disconnect between trajectory prediction accuracy and pursuit feasibility.

The Core Issue: Existing tracking methods optimize for minimizing prediction error (e.g., Average Displacement Error) but ignore the physical constraints of the interceptor (e.g., maximum velocity and acceleration).
The Consequence: A prediction can be mathematically accurate in hindsight but physically impossible to intercept. The authors claim state-of-the-art tracking methods produce infeasible pursuit plans 99.9% of the time, rendering them useless for autonomous chasing.
The Goal: To bridge the "perception-to-pursuit" gap by developing a system that predicts future trajectories which are not only accurate but also kinematically reachable by an interceptor within specific time constraints.

2. Methodology: Perception-to-Pursuit (P2P)

The proposed framework, P2P, shifts from frame-by-frame association to track-centric temporal reasoning.

A. Motion Token Representation

Instead of processing raw pixels or bounding boxes directly, the system compresses drone motion into compact, interpretable 8-dimensional motion tokens ( $z_t$ ):

Position ( $x, y$ ): Image-plane coordinates.
Velocity ( $v_x, v_y$ ): First-order motion (finite differences).
Acceleration ( $a_x, a_y$ ): Second-order motion (changes in velocity).
Scale ( $s$ ): Object size proxy ( $\sqrt{w \cdot h}$ ).
Smoothness ( $\sigma$ ): Trajectory stability over a 5-frame window (standard deviation of position changes).

B. Temporal Transformer Architecture

The core engine is a 12-frame causal Transformer designed for autoregressive prediction:

Input: A sequence of 12 motion tokens ( $W=12$ ).
Processing: 4 layers of self-attention with 4 heads and a hidden dimension of 128. Causal masking ensures predictions rely only on past and present data.
Multi-Task Heads: The final representation feeds into four distinct prediction heads:
1. Drone Classification: Binary classification (Drone vs. Non-drone).
2. Behavior Classification: 5-class categorization (hover, loiter, approach, evade, pass-by).
3. Intent Regression: A scalar score representing maneuver aggressiveness.
4. Trajectory Prediction: Future positions over a 20-frame horizon ( $H=20$ ).

C. Training Objective

The model is trained using a weighted multi-task loss function ( $L$ ) combining:

Binary cross-entropy for drone classification.
Categorical cross-entropy for behavior.
Mean Squared Error (MSE) for intent.
Smooth L1 loss for trajectory prediction.
Key Insight: Joint training allows behavior and intent priors to guide trajectory prediction, improving both accuracy and feasibility.

3. Key Contributions

Pursuit-Aware Temporal Reasoning:
The authors propose a motion-centric approach that reasons over sequences of motion tokens rather than static frames. This enables open-world discrimination (identifying drones without prior appearance models) and generates trajectories that respect kinematic constraints.
Intercept Success Rate (ISR) Metric:
A novel metric introduced to measure actionability. ISR calculates the fraction of predictions where the interceptor can physically reach the target position given constraints ( $v_{max}=15$ m/s, $a_{max}=5$ m/s²) using bang-bang optimal control.
- Formula: $ISR = \frac{1}{N} \sum I[t_{reach}(\| \hat{p}_i - p_0 \|) \leq t^*_i]$
- This directly quantifies whether a prediction is useful for planning a chase.
Empirical Validation of the "Perception-to-Pursuit" Gap:
The study empirically demonstrates that optimizing solely for accuracy leads to useless pursuit plans, while temporal reasoning over motion patterns solves this dual problem.

4. Experimental Results

Evaluated on the Anti-UAV-RGBT dataset (226 real drone sequences, 8,092 test examples):

Metric	Frame-Based	Tracking Only	Naive Velocity	P2P (Ours)
ADE (Pixels)	261.07	122.45	122.83	28.12
FDE (Pixels)	261.73	52.53	53.24	41.14
ISR (Feasibility)	1.000*	0.001	0.001	0.597
Classification Acc.	0.000	0.000	0.000	100%

*Note: Frame-based ISR is 1.000 because it assumes a stationary target, which is trivially reachable but inaccurate.

Trajectory Accuracy: P2P achieves a 77% improvement in Average Displacement Error (ADE) compared to the best baseline (Naive Velocity).
Pursuit Feasibility: P2P achieves an ISR of 0.597, meaning ~60% of predictions are physically interceptable. This represents a 597× improvement over tracking-only baselines (which have an ISR of 0.001).
Open-World Discrimination: P2P achieves 100% accuracy in classifying drones without using appearance features, relying solely on motion patterns.
Efficiency: The system runs at 323 FPS on an NVIDIA T4 GPU, satisfying real-time requirements.

5. Significance and Implications

Redefining Success: The paper argues that in robotics and autonomous systems, prediction accuracy is insufficient without feasibility. A prediction is only valuable if an agent can act upon it.
Motion as a Universal Feature: The 100% classification accuracy without appearance features suggests that temporal motion patterns are a robust, generalizable signal for object recognition in open-world scenarios, potentially outperforming appearance-based methods for novel objects.
End-to-End Pipeline: P2P provides a principled, end-to-end solution for counter-drone systems, moving beyond isolated detection/tracking modules to a unified system capable of actionable pursuit planning.
Generalizability: The concept of "feasibility-aware prediction" extends beyond drones to autonomous driving, pedestrian interaction, and any domain where an agent must physically interact with a predicted future state.

6. Limitations

Dataset Scope: Validated only on Anti-UAV-RGBT; cross-dataset validation is needed.
Fixed Constraints: ISR assumes fixed interceptor limits; adaptive constraints for different interceptor types are future work.
Single Target: Currently handles single-target pursuit; multi-drone coordination is not yet addressed.
Failure Modes: Performance degrades on sudden, unpredictable maneuvers (e.g., emergency stops) or during occlusions that break temporal continuity.