GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Imagine you are playing a game of "Where's Waldo?" in a crowded, chaotic video. You are trying to keep your eyes on Waldo as he moves through a busy market.

The Problem with Current Trackers:
Most computer programs trying to do this are like a person who has never left their house. They only know what things look like on a flat piece of paper. If Waldo puts on a hat, or if a tree branch briefly blocks your view, the computer gets confused. It thinks, "Oh, the hat is gone, so Waldo must be gone," or it mistakes a mannequin in a shop window for Waldo because they look similar. It lacks "common sense" about how the world works in 3D.

The Human Advantage:
Humans are great at this because we have a secret weapon: 3D intuition. Even if we only see a flat video, our brains automatically know that objects have depth, volume, and that if something is behind a wall, it's still there. We use "prior knowledge" to guess where Waldo is, even when we can't see him perfectly.

The Solution: GOT-Edit
The paper introduces a new system called GOT-Edit. Think of it as giving the computer a "brain transplant" that teaches it how to see in 3D, even though it's only looking at a 2D video.

Here is how it works, using a simple analogy:

1. The Two Brains (Semantics vs. Geometry)

Imagine the computer has two ways of thinking:

The "Look-Alike" Brain (Semantics): This brain is very good at recognizing faces and colors. "That's Waldo because he has a red and white striped shirt." But it gets confused if the shirt is covered.
The "Architect" Brain (Geometry): This brain understands shapes, depth, and space. "That object has volume and is sitting on the ground, so it's likely a person, not a flat poster."

2. The Conflict

The researchers tried to just smash these two brains together. But it was like trying to mix oil and water. When they forced the "Architect" brain to talk to the "Look-Alike" brain, the "Look-Alike" brain got confused and started making mistakes. The computer forgot what Waldo looked like because it was too focused on the 3D shapes.

3. The Magic Trick: "Online Model Editing"

This is the core innovation of the paper. Instead of forcing the two brains to merge into a messy blob, they use a technique called Null-Space Editing.

Think of the "Look-Alike" brain's knowledge as a solid, unbreakable glass sculpture. You don't want to melt it or crack it, because that's how the computer recognizes Waldo's shirt.

Now, imagine the "Architect" brain wants to add new information (like "Waldo is behind a fence"). Instead of smashing the glass, the researchers use a special tool to carve a tiny, invisible channel inside the glass.

They take the 3D information.
They run it through a filter (the "Null-Space Constraint") that ensures it only fits into the empty spaces where the 3D info doesn't contradict the 2D look.
It's like adding a secret layer of 3D intuition underneath the 2D recognition without changing the surface.

4. The Result: A Super-Tracker

Because of this "surgery," the computer can now:

See through occlusions: If Waldo is behind a tree, the 3D brain knows he's still there, and the 2D brain doesn't panic.
Ignore distractions: If there's a mannequin that looks like Waldo, the 3D brain says, "That's a flat statue, not a real person," and the tracker ignores it.
Stay calm: It doesn't get confused when the lighting changes or the object moves fast.

The "Online" Part

Most computer updates happen in a lab, like updating a phone app once a year. GOT-Edit is "online," meaning it learns and adjusts while it is watching the video. It's like a human who learns from every second of the video in real-time, instantly realizing, "Oh, the light just changed, I need to adjust my guess," without ever stopping the game.

In Summary

GOT-Edit is like teaching a computer to stop looking at a video like a flat photograph and start seeing it like a human does—with depth and common sense. It does this by carefully adding 3D "common sense" to the computer's brain without breaking its ability to recognize what things look like. The result is a tracker that is much harder to fool, even in messy, crowded, or tricky situations.

1. Problem Statement

Generic Object Tracking (GOT) aims to track an arbitrary user-specified target object in 2D video streams based on an initial bounding box. While human perception leverages implicit 3D geometric knowledge and semantic reasoning to handle challenges like partial occlusion, distractors, and deformations, most existing GOT methods rely solely on 2D features.

Limitations of Current Approaches:
- 2D Reliance: Most trackers are trained on 2D datasets, limiting their ability to reason about spatial relationships and object boundaries in complex scenes.
- 3D Data Dependency: Existing attempts to incorporate 3D information often require auxiliary inputs (e.g., RGB-D sensors, point clouds), which are impractical for standard 2D video streams.
- Naive Fusion: Simply fusing 2D semantic features with 3D geometric features often degrades performance because the geometric cues can overwhelm or distort the dominant semantic information required for object discrimination.

2. Methodology: GOT-Edit

The authors propose GOT-Edit, an online cross-modality model editing framework that integrates geometry-aware cues into a generic object tracker using only 2D video streams. The core innovation is the use of Online Model Editing with Null-Space Constraints to balance geometric and semantic information.

Key Components:

Feature Extraction:
- Semantic Features: Extracted using a pre-trained DINOv2 (ViT-L) backbone.
- Geometric Features: Extracted using the Visual Geometry Grounded Transformer (VGGT). VGGT infers camera pose, point maps, and depth from 2D images, providing 3D geometric cues without requiring 3D sensors.
Feature Alignment and Fusion:
- Geometric features are aligned to match the resolution and dimensionality of semantic features via a convolutional network.
- A gating mechanism fuses these features, creating an enriched representation that serves as input for the tracker.
Online Model Editing with Null-Space Constraints (The Core Innovation):
- Inspiration: The method is inspired by AlphaEdit, which edits model weights to inject new knowledge while preserving old knowledge.
- Mechanism:
  - The tracker employs two model predictors: one for Semantic Weights ( $W_{sem}$ ) and one for Geometric Perturbation Weights ( $\Delta$ ).
  - To prevent the geometric information from corrupting the established semantic understanding (which is crucial for distinguishing the target from distractors), the geometric perturbation is projected into the null space of the semantic features.
  - Mathematical Formulation: The final weights applied to the localization head are $W_{final} = W_{sem} + P_{null}\Delta$ , where $P_{null}$ is the null-space projection matrix computed via Singular Value Decomposition (SVD) of the semantic features.
  - Result: This ensures that the geometric updates ( $\Delta$ ) only affect the dimensions of the feature space where semantic information is minimal (the null space), thereby preserving semantic discrimination while adding geometric robustness.
Architecture:
- Built upon the ToMP (Transformer-based Model Prediction) framework.
- Uses a Transformer encoder-decoder to generate model weights dynamically for each frame based on reference frames and the current frame.

3. Key Contributions

First 2D-to-3D Integration via Editing: Introduces the first framework to integrate 3D geometric reasoning into generic object tracking using only 2D streaming inputs, eliminating the need for RGB-D sensors or point clouds.
Null-Space Constrained Online Editing: Proposes a novel online model editing technique that adaptively incorporates 3D geometric knowledge without degrading the dominant 2D semantic features. This solves the "catastrophic forgetting" or semantic degradation problem common in naive fusion strategies.
State-of-the-Art Performance: Demonstrates that this approach unlocks geometric knowledge missing in existing 2D trackers, leading to superior robustness in occlusion, clutter, and complex geometric scenarios.

4. Experimental Results

The method was evaluated on multiple standard benchmarks (GOT-10k, LaSOT, TrackingNet, NfS, OTB, AVisT, VOT).

Overall Performance: GOT-Edit achieves superior results compared to state-of-the-art trackers (e.g., ToMP, LoRAT, PiVOT, MCITrack).
- On GOT-10k, it achieves an Area Under Curve (AUC) of 80.2% (vs. 76.9% for ToMP-378).
- On LaSOT, it achieves 91.0% precision (vs. 90.0% for PiVOT).
- On OTB, it consistently outperforms competitors in Success Rate (SUC).
Robustness to Adverse Conditions:
- Significant improvements are observed in attributes related to 3D reasoning, such as Occlusion, Background Clutter, and Scale Variation.
- Ablation Studies:
  - Naive fusion of geometry and semantics (Row 2 in Table 4) actually decreases performance compared to the semantic-only baseline.
  - Applying the Null-Space Constraint (Row 5) recovers and improves performance, proving the necessity of the constraint to preserve semantic integrity.
  - Adding regularization and whitening (Row 6) further boosts performance.
Efficiency: While VGGT adds computational cost, the core model editing modules (alignment, fusion, predictors) are highly efficient (approx. 9ms at 252x252 resolution). The authors also demonstrate variants using StreamVGGT to reduce runtime by ~40% with minimal accuracy loss.

5. Significance and Impact

Paradigm Shift: GOT-Edit establishes a new paradigm for combining 2D semantics with 3D geometric reasoning without requiring multi-modal hardware. It mimics human perception by inferring 3D structure from 2D inputs to aid tracking.
Generalization: The framework generalizes well across diverse datasets and unseen object classes, proving that geometric cues are a universal asset for tracking robustness.
Methodological Advance: The application of null-space constrained model editing to computer vision tasks (specifically tracking) offers a powerful tool for integrating auxiliary modalities (like geometry, depth, or audio) into existing models without retraining or losing original capabilities.
Practicality: By relying solely on 2D video, the method is immediately deployable in real-world scenarios where 3D sensors are unavailable, such as autonomous driving, surveillance, and robotics.

In conclusion, GOT-Edit successfully bridges the gap between 2D tracking and 3D reasoning, demonstrating that principled model editing can recover geometric information missed by purely 2D approaches, leading to significantly more robust and accurate object tracking.

GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

1. The Two Brains (Semantics vs. Geometry)

2. The Conflict

3. The Magic Trick: "Online Model Editing"

4. The Result: A Super-Tracker

The "Online" Part

In Summary

1. Problem Statement

2. Methodology: GOT-Edit

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Interpretable Battery Aging without Extra Tests via Neural-Assisted Physics-based Modelling

OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images

A High Voltage Test System Meeting Requirements Under Normal and All Single Contingencies Conditions of Peak, Dominant, and Light Loadings for Transmission Expansion Planning Studies (TEP) and TEP Case Studies

Temporal Logic Control of Nonlinear Stochastic Systems with Online Performance Optimization

Dissipativity Analysis of Nonlinear Systems: A Linear--Radial Kernel-based Approach