TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

This paper introduces TAPFormer, a transformer-based framework that achieves robust arbitrary point tracking by employing a Transient Asynchronous Fusion mechanism to effectively bridge the temporal gap between low-rate RGB frames and high-rate event streams, outperforming existing methods on both a newly constructed real-world dataset and standard benchmarks.

Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen, Xieyuanli Chen, Dewen Hu

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are trying to follow a specific red dot on a fast-moving car as it zooms past you.

If you use a standard camera (like the one in your phone), you are taking a series of photos. If the car moves too fast, the dot becomes a blurry smear. If the sun is too bright or it's pitch black, the camera might be blinded or see nothing at all. You lose track of the dot because your "photos" are too slow and too sensitive to bad lighting.

If you use an event camera (a special, futuristic sensor), it doesn't take photos. Instead, it's like a room full of tiny, hyper-sensitive motion detectors. Every time a single pixel changes brightness, it shouts out a message: "Hey! Something moved here at 12:00:01!" These sensors are incredibly fast and work in the dark, but they don't see colors or shapes. They just see "movement." If the car stops moving, the event camera goes silent, and you lose the dot again.

The Problem:
Trying to combine these two is like trying to have a conversation between a slow, detailed storyteller (the standard camera) and a frantic, high-speed telegraph operator (the event camera).

  • If you just mash them together, the timing is off. The storyteller is still talking about the car's position from a second ago, while the telegraph operator is already shouting about where it is now.
  • If the car moves fast, the storyteller gets blurry.
  • If the car stops, the telegraph operator stops talking.

The Solution: TAPFormer
The researchers built TAPFormer, a new AI system that acts like a brilliant conductor for this orchestra. It doesn't just listen to both; it understands how they speak differently and merges them into a single, perfect story.

Here is how it works, using simple analogies:

1. The "Time-Traveling Bridge" (Transient Asynchronous Fusion)

Usually, AI tries to force the fast event camera to wait for the slow photo camera, or vice versa. TAPFormer does something smarter.

Imagine the standard camera is a slow-motion video, and the event camera is a stream of water.

  • When the slow video takes a picture, TAPFormer grabs that picture.
  • But instead of waiting for the next picture, it uses the "stream of water" (the events) to fill in the gaps between the photos.
  • It treats the scene as a continuous movie, not a series of stills. Even if the camera only takes 20 photos a second, TAPFormer uses the event data to update the position of the dot 100 times a second. It's like having a GPS that updates your location every millisecond, even if your map only refreshes every few seconds.

2. The "Smart Spotlight" (Cross-Modal Locally Weighted Fusion)

Sometimes the standard camera is blurry (because of motion), but the event camera is sharp. Other times, the event camera is silent (because the object stopped), but the standard camera is clear.

TAPFormer has a smart spotlight that knows exactly which sensor to trust at any given moment.

  • Scenario A: The car is speeding, and the photo is a blur. The spotlight shines on the event camera, saying, "You're the expert right now! Tell us where the dot is."
  • Scenario B: The car stops, and the event camera goes quiet. The spotlight switches to the standard camera, saying, "You're the expert now! Show us the details."
  • It does this for every tiny part of the image, instantly. It's like a team of two detectives where one is great at seeing details in the dark, and the other is great at spotting fast movement. They constantly swap the "lead detective" role depending on who has the best clue at that exact second.

3. The "Super-Tracker"

By combining these two tricks, TAPFormer can track a point on a car moving at 100 mph in the rain, at night, or when the sun is glaring.

  • Standard trackers lose the dot when the car blurs.
  • Event-only trackers lose the dot when the car stops.
  • TAPFormer never loses the dot. It uses the "water stream" to smooth out the "slow photos," creating a super-stable, high-definition path that no other method can match.

Why This Matters

This isn't just about tracking dots. This technology is the brain behind:

  • Self-driving cars that need to see pedestrians jumping out from behind a truck in the rain.
  • Augmented Reality (AR) glasses that need to stick a virtual sticker perfectly to a moving object, even if you move your head fast.
  • Robotics that need to catch a ball without dropping it, even in low light.

In short, TAPFormer is the ultimate team player. It takes the "slow but detailed" and the "fast but vague," and fuses them into a super-vision that sees everything, all the time, no matter how chaotic the world gets.