Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

The paper introduces HeFT, a zero-shot point tracking framework that leverages pretrained video diffusion models by analyzing their internal representations to identify specialized attention heads and low-frequency components, which are then selectively harnessed through a denoising-based strategy to achieve state-of-the-art tracking performance without annotated training data.

Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian

Published 2026-03-24
📖 5 min read🧠 Deep dive

Imagine you are trying to follow a specific red balloon as it floats through a crowded, chaotic parade. The balloon gets hidden behind a marching band, then pops out, then gets obscured by a tree. This is the challenge of Point Tracking in computer vision: keeping your eye on a single dot as it moves through a video.

Most computer programs that do this today are like students who have memorized a textbook. They were trained on millions of labeled videos (where humans drew dots on every frame). They are good, but if they see a parade they haven't seen before, or if the lighting changes, they get confused. They need a lot of expensive homework (training data) to learn.

This paper introduces a new method called HeFT (Head-Frequency Tracker). Instead of memorizing a textbook, HeFT uses a super-intelligent artist who has never seen your specific parade but has watched every parade ever made.

Here is how it works, broken down with simple analogies:

1. The "Super-Artist" (The Video Diffusion Model)

The researchers use a pre-trained AI called a Video Diffusion Transformer (VDiT). Think of this AI as a master painter who has spent years learning how to generate realistic videos from scratch. Because it learned to create videos, it understands how objects move, how they look, and how they relate to each other over time. It has "common sense" about the world.

The team asked: Can we use this artist's brain to track a balloon, even though the artist was never taught to track things?

2. The "Orchestra" Analogy (Head Specialization)

Inside this super-intelligent AI, there isn't just one brain; there is a massive orchestra of Attention Heads.

  • The Old Way: Previous methods treated the whole orchestra as one big, blurry sound. They took the average of everything.
  • The HeFT Discovery: The researchers realized that each musician (head) has a special job.
    • Some musicians are Matchmakers: They are experts at finding "that's the same balloon from the last frame!"
    • Some are Semantics: They care about the type of object (e.g., "that's a person").
    • Some are Positional: They care about where things are in space.

The Analogy: Imagine trying to find a friend in a crowd. If you ask a whole group of people (the whole layer) for help, you get a confused mumble. But if you ask the one person who is an expert at recognizing faces (the Matching Head), they will point you right to your friend. HeFT ignores the noise and only listens to the "Matchmaker" musician.

3. The "Radio Static" Analogy (Frequency Filtering)

The AI processes information in different "frequencies," kind of like radio waves.

  • Low Frequencies: These are the smooth, steady signals. They tell you the big picture: "The balloon is moving left." They are stable and reliable.
  • High Frequencies: These are the sharp, jagged signals. They contain tiny details like texture, noise, or static. While they look cool, they are often noise that confuses the tracker.

The Analogy: Imagine trying to hear a conversation in a noisy room. The "High Frequencies" are the clinking of silverware and the chatter of other tables. The "Low Frequencies" are the clear voice of your friend. HeFT puts on noise-canceling headphones to block out the high-frequency static and only listens to the smooth, low-frequency voice that tells the truth about where the object is.

4. The "One-Step" Trick (Denoise to Track)

Usually, these AI models take many steps to turn a blurry mess into a clear video. The researchers found a shortcut. They realized that if they take a real video, add a tiny bit of "noise" to it, and then let the AI clean it up just once, the AI's internal "Matchmaker" neurons light up with the perfect information needed to track the point.

It's like asking the artist to "fix" a slightly blurry photo for a split second. In that split second, the artist's brain reveals exactly where the object is, without needing to finish painting the whole picture.

5. The "Safety Net" (Forward-Backward Check)

To make sure the tracker doesn't get lost when the balloon hides behind a tree, HeFT uses a safety net.

  • It tracks the balloon forward in time.
  • Then, it tracks backward from the new spot to the start.
  • If the two paths don't meet up perfectly, the system knows the balloon is hidden (occluded) and stops guessing, preventing it from drifting off into the wrong part of the screen.

The Result

By listening to the right "musician" (Head Selection) and filtering out the "static" (Frequency Filtering), HeFT can track points in videos without any training data.

  • Zero-Shot: It hasn't been taught with labeled dots. It just uses its general knowledge of how the world works.
  • Performance: It performs almost as well as the super-expensive, heavily trained systems, but it's much more robust and doesn't need a massive dataset.

In short: The paper teaches us that we don't need to train a new specialist for every job. We can just take a generalist genius (the Video Diffusion Model), ask the right specific question (select the right head), and ignore the noise (filter the frequencies) to get world-class results.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →