Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

The Big Problem: The "Labeling Bottleneck"

Imagine you are a doctor trying to teach a computer to recognize a beating heart in an ultrasound video. To do this, the computer needs to learn by example. A human expert has to draw a line around the heart in every single frame of the video.

If a video has 1,000 frames, that's 1,000 drawings. If you have 10,000 videos, that's 10 million drawings. This is incredibly expensive and slow (think hundreds of dollars per hour for expert time). It's like trying to paint a masterpiece by hand, one tiny dot at a time, for a whole gallery.

The Old Solutions (And Why They Failed)

Scientists tried to automate this by letting the computer "guess" the rest of the drawings based on the first one.

The "Video Tracker" approach: Imagine a GPS that works great for one specific car trip but forgets the map the moment you start a new trip. Old trackers could follow a heart in one video, but they couldn't learn from a heart in a different patient's video.
The "Keypoint" approach: Imagine trying to match two photos of a foggy wall. You can't find any distinct features (like a crack or a stain) to grab onto. Old methods relied on finding these "distinct features," but medical images are often smooth and blurry, making them fail.
The "One-Shot" approach: These are like students who memorize one specific textbook perfectly but fail if you ask them a question from a slightly different book. They struggle to generalize across different videos.

The New Solution: Match4Annotate

The authors created Match4Annotate, a smart system that acts like a super-intelligent, flexible translator. It can take a drawing you made on one video (or even a different person's video) and instantly "propagate" (spread) that drawing to every other frame, even across different patients.

Here is how it works, broken down into three simple steps:

1. The "Infinite Zoom" Map (Implicit Neural Features)

Usually, computer vision sees images like a low-resolution grid (like a pixelated Minecraft world). If you zoom in, it gets blocky.

The Analogy: Imagine you have a low-res map of a city. If you want to know the street name at a specific corner, you might guess wrong because the map is blurry.
The Fix: Match4Annotate uses a special mathematical tool called SIREN to turn that pixelated map into a smooth, infinite-resolution fluid. It's like having a map where you can zoom in to the molecular level, and the streets are still perfectly clear. It learns the "essence" of the heart's shape, not just the pixels. This allows it to find the heart in a new video even if the image is blurry or the angle is different.

2. The "Flow Guide" (Implicit Deformation Field)

When a heart beats, it doesn't just move; it stretches, squishes, and twists.

The Analogy: Imagine trying to match a photo of a balloon before it's inflated to one after it's inflated. If you just look for the "same spot," you'll get lost.
The Fix: The system learns a "Flow Guide." It's like a weather map showing wind currents. It predicts, "If the heart moves this way, the tissue here will stretch that way." It uses this prediction to guide the matching process, ensuring the computer doesn't get confused by the stretching. It tells the system, "Don't look for the exact pixel; look for the pixel that would be there if the heart moved like this."

3. The "Interior Point" Strategy (For Masks)

Sometimes you need to draw the inside of the heart, not just the outline.

The Analogy: If you try to draw a circle by only connecting the dots on the edge, and one dot is wrong, the whole circle looks jagged and broken.
The Fix: Instead of just tracking the edge, Match4Annotate picks hundreds of dots inside the heart shape. It moves all those inner dots to the new frame. Then, it uses a "spray paint" technique (Kernel Density Estimation) to fill in the shape based on where all those dots landed. Even if a few dots land slightly off, the "spray paint" smooths it out, creating a perfect, solid shape.

Why This Matters

It's Universal: You can draw a heart on Patient A's video, and the system can instantly draw the heart on Patient B's video, even if they have different heart sizes or shapes.
It's Fast: It doesn't need a supercomputer. It can be trained on a standard gaming PC in just a few minutes per video.
It's Flexible: It handles both single points (like tracking a specific spot on a bone) and full shapes (like outlining a whole organ).

The Bottom Line

Match4Annotate is like giving a computer a "sixth sense" for medical videos. Instead of forcing the computer to memorize every single frame, it teaches the computer to understand the flow and shape of the anatomy. This means doctors can label a few frames, and the computer does the rest, saving thousands of hours of expensive expert time and making advanced medical AI accessible to more hospitals.

1. Problem Statement

The deployment of computer vision in specialized domains, particularly medical imaging (e.g., ultrasound), is bottlenecked by the high cost and time required for expert per-frame annotation.

Current Limitations:
- Video Trackers/Segmenters (e.g., SAM2, CoTracker3): Effective for intra-video propagation but require per-video initialization and cannot generalize labels across different video sequences (inter-video).
- Classic Correspondence Pipelines (e.g., SuperPoint, LightGlue): Rely on detector-chosen keypoints, failing in low-texture, low-contrast medical scenes where user-specified points or dense masks cannot be reliably matched.
- Foundation Model Correspondence (e.g., DIFT, MATCHA): Enable cross-video matching but often lack spatiotemporal smoothness, leading to jitter/drift, and struggle to unify point and mask propagation.
Goal: Develop a lightweight framework capable of propagating both point and mask annotations within a video (intra-video) and across different videos of the same anatomy (inter-video) without manual re-initialization.

2. Methodology: Match4Annotate

The proposed framework utilizes a test-time optimization strategy to fit implicit neural representations to features extracted from a frozen Vision Foundation Model (VFM). The pipeline consists of three core components:

A. High-Resolution Spatiotemporal Implicit Feature Representation

Base Features: Dense features are extracted from each frame using a frozen DINOv3 (ViT-S/16) model.
Implicit Representation (SIREN): Instead of using discrete feature maps, the method fits a SIREN (Sinusoidal Representation Network) to learn a continuous function $f_\theta(x, y, t)$ $f_{θ} (x, y, t)$ .
- Input: Spatiotemporal coordinates $(x, y, t)$ .
- Output: High-resolution feature vectors.
- Advantage: This allows querying features at arbitrary spatial resolutions (sub-patch granularity) and enforces smoothness over space and time, mitigating interpolation artifacts common in medical ultrasound.
Training: The network is optimized via a reconstruction loss that aligns the upsampled SIREN features with the downsampled DINOv3 features.

B. Flow-Guided Matching Correspondence

To handle large deformations and improve matching reliability, the method learns an implicit deformation field.

Displacement SIREN: A separate lightweight network $g_\phi(x, y)$ predicts per-coordinate 2D displacements $(\Delta x, \Delta y)$ between a source and target frame.
Optimization: The displacement field is optimized to minimize the feature distance between the source point and the displaced target point, regularized by Total Variation (TV) and L1 loss to ensure smoothness and prevent unnecessary deformation.
Matching Strategy: The predicted displacement serves as a spatial prior. Final correspondences are determined by maximizing the cosine similarity of features within a Gaussian-weighted region centered on the flow-predicted location. This combines the global search capability of flow with the local discriminative power of features.

C. Mask Propagation via Interior Point Method

Rather than propagating only boundary points (which is prone to noise), the method uses an interior point approach:

Extraction: Dense interior points are sampled from the source binary mask using the Euclidean Distance Transform (EDT).
Propagation: All interior points are propagated to the target frame using the flow-guided matching strategy.
Reconstruction: The target mask is reconstructed using Kernel Density Estimation (KDE) on the propagated points, followed by thresholding. This provides robustness against individual point mismatches.

3. Key Contributions

Unified Framework: The first lightweight framework to support unified propagation of both points and masks for both intra-video and inter-video scenarios.
Implicit Neural Features: Introduction of a test-time SIREN-based optimization to upsample foundation model features into a continuous, high-resolution spatiotemporal field, enabling smooth feature queries.
Flow-Guided Strategy: A novel matching mechanism that uses a learned implicit deformation field as a prior to guide feature matching, significantly improving robustness in low-texture medical scenes.
Efficiency: The system is lightweight, training on consumer hardware (RTX 4090) in minutes per video without requiring large-scale pre-training on specific medical data.

4. Experimental Results

The method was evaluated on three challenging clinical ultrasound datasets: EchoNet-Dynamic (cardiac), MSK-POI, and MSK-Bone (musculoskeletal).

Inter-Video Propagation (Cross-Video)

Point Matching: Match4Annotate achieved State-of-the-Art (SOTA) performance, outperforming dense feature matching baselines (RoMa, DIFT, MATCHA) on PCK (Percentage of Correct Keypoints) metrics, particularly at coarser thresholds.
Mask Propagation: Using only a single source frame (1-shot), it matched the performance of UniverSeg using 5–10 shots and significantly outperformed all 1-shot segmentation baselines (e.g., Matcher, UniverSeg 1-shot) in Dice scores.
Key Insight: It successfully generalized annotations across different subjects, a task where standard trackers fail.

Intra-Video Propagation (Within-Video)

Performance: While specialized trackers (CoTracker3, TAPNext) slightly outperformed Match4Annotate in pure point tracking on some datasets, Match4Annotate remained competitive.
Trade-off: It offers a unified pipeline that handles both points and masks and supports cross-video transfer, whereas specialized trackers usually handle only one modality or require per-video setup.

Ablation Studies

Flow Prior: Removing the learned displacement field caused a significant drop in performance, confirming the necessity of the flow prior for handling anatomical deformations.
Implicit Representation: Using the continuous SIREN representation yielded better cross-video generalization compared to using direct high-resolution DINOv3 features, which preserved fine temporal details but lacked smoothness for cross-video matching.

5. Significance and Impact

Scalability: Match4Annotate addresses the "annotation bottleneck" in medical AI by enabling the transfer of expert labels from a few annotated frames/videos to large, unannotated datasets.
Accessibility: By operating efficiently on consumer hardware with test-time optimization, it lowers the barrier to entry for deploying annotation tools in specialized domains without massive compute resources.
Robustness: The combination of implicit neural representations and flow guidance provides a robust solution for low-texture, low-contrast medical imaging where traditional feature matching fails.
Broader Impact: The method has the potential to democratize large-scale video analysis in medical imaging, reducing the linear scaling of labeling costs and accelerating the development of clinical computer vision tools.

Limitations: The method relies on smoothness priors, which may struggle with the rapid, large displacements found in natural RGB videos. It also does not explicitly handle occlusions and may require adaptation for non-ultrasound imaging modalities.