EdgeDAM: Real-time Object Tracking for Mobile Devices

Imagine you are playing a game of "Hide and Seek" with a friend in a crowded, chaotic marketplace. Your goal is to keep your eyes locked on your friend (the target) as they move through the crowd, even when they get blocked by stalls, run behind other people, or when there are people who look exactly like them (the distractors).

This is the challenge of Object Tracking in computer vision. For a long time, computers have struggled with this on mobile phones because the "smart" ways to do it are too heavy (like trying to carry a library in your pocket), and the "light" ways are too easily confused (like a child who loses focus when a shiny toy appears).

Enter EdgeDAM, a new system designed to be the ultimate "smart tracker" that fits right in your pocket. Here is how it works, broken down into simple concepts:

1. The Problem: The "Heavy" vs. The "Fragile"

The Heavyweights: Some tracking systems are like super-scientists. They take a photo of your friend, analyze every pixel, and build a 3D map of their face and clothes. If your friend disappears behind a wall, the scientist remembers exactly what they looked like and finds them again.
- The Catch: This takes so much brainpower that it runs at 2 frames per second on a super-computer. On a phone, it would be too slow to be useful.
The Lightweights: Other systems are like fast runners. They just guess where your friend will be next based on their speed. They are super fast (30 frames per second).
- The Catch: If your friend hides behind a wall, the runner just keeps running in the wrong direction. If a stranger walks by who looks like your friend, the runner mistakes them for your friend and loses the real one.

EdgeDAM is the Smart Detective. It combines the speed of the runner with the memory of the scientist, but it does it in a way that doesn't require a super-computer.

2. The Secret Sauce: The "Dual-Buffer Memory"

Instead of trying to remember every single detail of your friend's face (which is heavy), EdgeDAM uses a clever two-notebook system called DAM (Distractor-Aware Memory):

Notebook A: The "Recent History" (RAM)
Think of this as a sticky note on your fridge. It only holds the last few seconds of your friend's movement. It checks: "Does this person look like they are in the right place and moving at the right speed?" If yes, it keeps the note. If a stranger walks by, the sticky note says, "Nope, wrong size or speed," and ignores them. This keeps the tracker from getting confused by things that just happen to look similar for a split second.
Notebook B: The "Safe Haven" (DRM)
Think of this as a photo album kept in a safe. This notebook stores the "best, most stable" versions of your friend (like when they were standing still and clearly visible).
- How it helps: If your friend gets completely hidden behind a wall for a long time, the "Recent History" (Notebook A) might get confused. But EdgeDAM opens the "Photo Album" (Notebook B). It looks at the stored photos and says, "Ah! I remember what they looked like before they vanished. I'll use that to find them again."
- The "Bad Guy" List: Crucially, if a stranger (distractor) tricks the system, EdgeDAM writes their name on a "Do Not Pick" list. If that same stranger shows up again, the system immediately rejects them, even if they look a bit like your friend.

3. The "Freeze and Expand" Trick

Imagine your friend is hiding behind a large truck. You can't see them. A normal tracker might panic and start guessing wildly, or it might lock onto the truck and think the truck is your friend.

EdgeDAM has a special move called Held-Box Stabilization:

When it loses sight of your friend, instead of panicking, it "freezes" the last known position.
It then gently expands the box (like a safety net) around that spot.
It waits patiently, ignoring the crowd, until your friend pops out. This prevents the system from accidentally grabbing a random passerby while it's waiting for the real target to reappear.

4. Why is this a Big Deal?

No Heavy Lifting: It doesn't need to analyze complex "masks" (pixel-perfect outlines) like the heavy scientists do. It just uses simple shapes (boxes) and basic colors. This makes it incredibly fast.
Real-Time on a Phone: The authors tested this on an iPhone 15. It runs at 25 frames per second. That means it's smooth enough to watch a video in real-time without lagging.
The Results: On difficult tests where objects are hidden or surrounded by look-alikes, EdgeDAM scored 88.2% accuracy. Compare that to the heavy "super-scientist" systems which often struggle to run on phones at all, or the fast systems which get confused easily.

The Bottom Line

EdgeDAM is like giving your phone a pair of eyes that are both fast and smart. It knows when to trust its gut (speed) and when to check its memory (reliability). It can follow your friend through a crowded market, ignore the people who look like them, and wait patiently if they hide, all while running smoothly on your pocket-sized device.

It solves the age-old problem of "How do we make a computer see clearly without making it slow?" by realizing that sometimes, you don't need a super-computer to find a friend; you just need a good memory and a little bit of patience.

1. Problem Statement

Visual Object Tracking (VOT) on edge devices (e.g., mobile phones) faces a critical trade-off between robustness and efficiency:

Robustness Gap: State-of-the-art trackers (e.g., SAM2.1++, SAMURAI) utilize memory-augmented mechanisms and segmentation masks to handle occlusion and distractors effectively. However, they rely on dense cross-frame attention and high-resolution feature fusion, resulting in high computational overhead (2–8 FPS on GPUs) that prevents real-time deployment on mobile hardware.
Efficiency Gap: Lightweight trackers (e.g., EdgeTAM, OSTrack) achieve real-time speeds (10–15 FPS) but lack sophisticated memory mechanisms. Consequently, they suffer from "drift" and identity switches when targets are occluded or when visually similar distractors appear.
Core Challenge: Existing methods fail to simultaneously achieve high accuracy under occlusion/distractor interference and real-time performance on resource-constrained edge devices without relying on heavy segmentation supervision or large pretrained encoders.

2. Methodology: EdgeDAM Framework

EdgeDAM proposes a detection-guided, bounding-box tracking framework that replaces heavy segmentation-based memory with a lightweight, geometrically gated memory system. The architecture consists of three main components:

A. Detection Backbone & Propagation

Detector: Uses a single-class YOLOv11s backbone. All object categories are unified into a single target index to ensure category-agnostic tracking.
Tracker: Integrates the classical CSRT (Correlation Filter) tracker to propagate the target between detection frames.
Adaptive Scheduling: To save computation, detection is not performed on every frame. It runs at a specific stride ( $\Delta=3$ ) and restricts the search to a Region of Interest (ROI) around the current estimate during stable tracking. Full-frame detection is triggered only when occlusion or ambiguity is detected.

B. Dual-Buffer Distractor-Aware Memory (DAM)

Instead of storing dense mask tokens, EdgeDAM operates on bounding boxes using a dual-buffer structure inspired by but redesigned from SAM2.1++:

Recent-Aware Memory (RAM): Stores a short history of verified target states.
- Mechanism: Uses geometric gating (IoU and area consistency) to admit candidates. This prevents distractors from polluting the recent target model during short occlusions.
- Descriptor: Uses lightweight appearance descriptors (concatenated $\ell_2$ -normalized grayscale patch and HSV color histogram) instead of deep features.
Distractor-Resolving Memory (DRM): Stores stable "anchor" entries for long-term recovery.
- Mechanism: Entries are promoted from RAM to DRM only if they show high appearance consistency over a time window (stability filter).
- Scoring: During recovery, DRM entries are scored using a weighted sum of IoU, appearance similarity (cosine distance), motion priors, and temporal decay.
- Distractor Penalization: A negative bank accumulates overlapping detections during occlusion. These are used to penalize DRM scores if a candidate matches a known distractor.

C. Confidence-Driven Switching & Held-Box Stabilization

Switching Logic: The system monitors the tracker's Peak-to-Sidelobe Ratio (PSR) and motion jumps. If confidence drops or occlusion is detected (multiple overlapping detections), the system switches from CSRT propagation to DAM-guided re-identification.
Held-Box Mechanism: During occlusion, the system does not immediately guess a new location. Instead, it "freezes" the current estimate and expands the bounding box based on the spatial extent of overlapping detections. This suppresses distractor contamination while waiting for a valid recovery signal.
Recovery Cascade:
1. Stage 1: Score DRM anchors; reinitialize if a high-confidence match is found.
2. Stage 2: If DRM fails, select the detection most consistent with appearance and motion priors.
3. Stage 3: Perform a normalized cross-correlation (NCC) template search if previous stages fail.

3. Key Contributions

Bounding-Box DAM: A novel memory module that replaces dense segmentation masks with compact geometric and appearance descriptors, eliminating the need for transformer-based attention or mask supervision.
Dual-Buffer Strategy: The separation of RAM (for temporal consistency) and DRM (for long-term recovery with distractor penalization) allows for robust handling of occlusion without the computational cost of global memory fusion.
Held-Box Stabilization: A post-processing strategy that adaptively freezes and expands the bounding box during uncertainty, preventing identity switches caused by distractors.
Edge-Optimized Design: A detector-agnostic framework (compatible with YOLOv8–v26) that achieves real-time performance on mobile devices by utilizing lightweight descriptors and adaptive inference scheduling.

4. Experimental Results

EdgeDAM was evaluated on five benchmarks (DiDi, VOT2020, VOT2022, LaSOT, LaSOText, GOT-10k) and deployed on an iPhone 15 Pro Max.

Performance on DiDi (Distractor-Focused):
- Achieved 0.926 Quality, 0.882 IoU, and 0.973 Robustness.
- Outperformed the previous SOTA (SAM2.1++) by +23.2% in quality and +15.5% in IoU.
VOT Benchmarks:
- VOT2020: EAO of 0.849 (vs. 0.729 for SAM2.1++).
- VOT2022: EAO of 0.790, surpassing the challenge winner MSAOT (0.673).
Generalization (LaSOT/LaSOText/GOT-10k):
- Achieved 0.895 AUC on LaSOT and 0.831 AO on GOT-10k, demonstrating strong category-agnostic generalization.
Efficiency:
- Runs at 25 FPS on an iPhone 15 Pro Max.
- Model size is only 9.4M parameters, significantly smaller than SAM2.1++ (224M) or SAMURAI (82.69M).
Ablation Studies: Confirmed that the combination of the detector, RAM, DRM, and held-box post-processing is essential; removing any component significantly degrades recovery rates and IoU.

5. Significance

EdgeDAM bridges the critical gap between robustness and efficiency in visual object tracking. By reformulating memory mechanisms to operate on bounding boxes with lightweight geometric gating, it demonstrates that high-performance tracking under occlusion and distractor interference does not require heavy segmentation pipelines or massive computational resources. This makes it the first method to achieve real-time, occlusion-resilient tracking on mobile devices while outperforming heavy, server-grade trackers in accuracy. The work opens new avenues for deploying advanced computer vision tasks directly on edge hardware without relying on cloud processing.