Perception-to-Pursuit: Track-Centric Temporal Reasoning for Open-World Drone Detection and Autonomous Chasing

This paper introduces Perception-to-Pursuit (P2P), a track-centric temporal reasoning framework that utilizes a causal transformer and a new Intercept Success Rate metric to bridge the gap between drone detection and kinematically feasible autonomous chasing, achieving significant improvements in both prediction accuracy and pursuit feasibility over existing tracking-only methods.

Venkatakrishna Reddy Oruganti

Published 2026-02-23
📖 4 min read☕ Coffee break read

Imagine you are playing a high-stakes game of tag with a drone. You are the chaser, and the drone is the runner.

Most current computer systems are like bad referees. They can tell you exactly where the runner was a second ago, and they can guess where they might be next based on a simple straight line. But here's the problem: if the runner suddenly does a sharp U-turn or speeds up, that straight-line guess is useless. Even worse, the system might predict a spot that is physically impossible for you to reach in time, even if your guess about their location was "mathematically correct."

This paper introduces a new system called Perception-to-Pursuit (P2P). Think of it as upgrading your brain from a "guessing machine" to a tactical coach.

Here is how it works, broken down into simple concepts:

1. The "Motion Token" (The Secret Language)

Instead of looking at the drone like a picture (which is heavy and full of unnecessary details like background trees), P2P translates the drone's movement into a compact 8-word sentence.

  • The Words: Where it is, how fast it's going, how fast it's speeding up (acceleration), how big it looks, and how smooth its path is.
  • The Analogy: Imagine trying to describe a dancer. A bad description says, "She is wearing a red dress and standing on a stage." A good description says, "She is spinning fast, accelerating to the left, and her movements are jerky." P2P speaks the language of movement, not just looks.

2. The "Time-Traveling Coach" (The Transformer)

The system uses a special AI brain (a Transformer) that looks at the last 12 frames of video (about half a second of history).

  • The Analogy: Think of a baseball pitcher. If you only look at where the ball is right now, you can't tell if it's a curveball or a fastball. But if you watch the pitcher's arm motion for the last split second, you can predict the curve.
  • P2P watches the drone's "arm motion" (its acceleration and turning patterns) to predict if it's about to dodge, hover, or speed up. It doesn't just guess a straight line; it guesses the intent.

3. The "Reality Check" (The New Scorecard)

This is the most important part. The authors realized that being "accurate" isn't enough. You need to be actionable.

  • The Old Way: "I predict the drone will be at the top of the mountain in 5 seconds." (Great prediction! But your interceptor drone can't fly that fast, so you can't catch it. The prediction is useless.)
  • The New Way (ISR Metric): The system introduces a new score called Intercept Success Rate (ISR). It asks: "Given my drone's top speed and turning limits, can I actually catch the target at this predicted spot?"
  • The Result: Old systems were wrong about catchability 99.9% of the time. P2P gets it right 60% of the time. That is a massive leap from "theoretically possible" to "actually doable."

4. The "Open-World" Superpower

Usually, AI needs to be trained on specific pictures of drones to recognize them. If it sees a new type of drone it's never seen before, it gets confused.

  • The Analogy: P2P is like a police officer who doesn't need to know the suspect's face. They just know that "only a drone moves like that." Because it focuses entirely on the motion pattern (how it hovers, turns, and accelerates), it can identify any drone, even ones it has never seen before, with 100% accuracy.

The Bottom Line

The paper solves a critical gap in autonomous defense.

  • Before: "I see the drone. I know where it will be. Good luck catching it." (Often impossible).
  • After (P2P): "I see the drone. I know it's about to dodge left. I know my drone can physically reach that spot in time. Let's go!"

It turns a passive observation system into an active, feasible pursuit system, ensuring that when the computer says "catch it," it actually means "you can catch it."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →