OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Imagine you are trying to build a perfect 3D map of a city while driving a car, looking out the window frame by frame. This is what computers do when they try to turn a video into a 3D model.

For a long time, there were two main problems with doing this:

The "All-at-Once" Problem: The best methods tried to look at the entire video at once to get the most accurate map. But this is like trying to remember every single word of a 10-hour movie to understand the plot. It requires a brain (computer memory) so big that it crashes after just a few minutes of video.
The "Streaming" Problem: Newer methods tried to process the video as it happens, frame by frame. But they kept a "notebook" of everything they saw so far. As the video got longer, the notebook got thicker and heavier until the computer ran out of space and had to stop.

Enter OVGGT: The Smart, Infinite Memory Driver.

The paper introduces OVGGT, a new system that can watch a video forever without running out of memory, while still building a super-accurate 3D map. It does this using two clever tricks, which we can think of as a Smart Librarian and a Safety Net.

1. The Smart Librarian (Self-Selective Caching)

Imagine your computer's memory is a small desk. As you watch the video, new information (tokens) keeps arriving. If you keep everything on the desk, it gets cluttered and you can't work.

Old streaming methods just kept adding papers to the desk until it overflowed. OVGGT acts like a Smart Librarian:

The Scorecard: Instead of just grabbing the newest paper, the librarian looks at the "importance score" of every piece of information. It asks, "Does this part of the image have a cool texture? Is it a sharp edge? Is it a building corner?"
The Cleanup: If a piece of information is just a boring, blurry patch of sky that looks exactly like the one from 10 seconds ago, the librarian throws it in the trash to make room.
The Smoothing: Crucially, the librarian doesn't just throw away random pieces. If you throw away a piece of a wall, you need to throw away the whole wall, not just a single brick. OVGGT ensures it keeps "chunks" of the image together so the 3D map doesn't look like a shattered mosaic.

The Result: The desk stays the same size, no matter how long the video is, but it only holds the most important details.

2. The Safety Net (Dynamic Anchor Protection)

Here is the tricky part: Even if you keep the most important details, you might forget where you started. Imagine driving around a giant roundabout. If you only remember the trees you just passed, you might forget that you started at the North Gate. In 3D mapping, this causes "drift"—the map starts to warp and twist because the computer lost its sense of direction.

OVGGT solves this with Anchors:

The First Frame Anchor: The system permanently locks the very first frame of the video. It's like tying a rope to the starting point of your journey. No matter how far you go, you can always pull on that rope to remember where "Zero" is.
The Historical Anchors: As you drive further, the first frame might be too far away to see clearly. So, the system picks new "checkpoints" (like a specific mountain peak or a unique building) every few minutes and ties a new rope to them. These checkpoints are protected from being thrown away.

The Result: Even after watching 1,000 frames, the computer never loses its sense of direction. The map stays straight and true.

Why is this a big deal?

It's Free: You don't need to retrain the AI. It's a "plug-in" that works with existing models.
It's Infinite: You can feed it a 10-minute video, a 1-hour video, or a 10-hour video. The memory usage stays exactly the same.
It's Fast: Because it isn't trying to remember everything, it runs faster than the old methods.
It's Accurate: Surprisingly, by throwing away the boring stuff, the 3D map actually looks better than methods that tried to keep everything (which got confused by too much noise).

In a nutshell: OVGGT is like a driver who knows exactly which landmarks to remember to navigate a city forever, without needing a map the size of a skyscraper. It keeps the essential details, ties a safety rope to the start, and drives on indefinitely.

1. Problem Statement

Reconstructing 3D geometry from streaming video requires continuous inference under bounded hardware resources. While recent Geometric Foundation Models (e.g., VGGT) achieve state-of-the-art (SOTA) reconstruction quality using all-to-all attention, their quadratic computational and memory complexity ( $O(N^2)$ ) restricts them to short, offline sequences.

To address this, causal-attention variants (e.g., StreamVGGT) were developed to enable single-pass streaming by caching Key-Value (KV) pairs. However, these methods suffer from a critical bottleneck: the KV cache grows linearly with the sequence length.

Memory Exhaustion: Even with 32GB VRAM, standard causal models run out of memory (OOM) within a few hundred frames.
Performance Degradation: As the sequence lengthens, the per-step attention cost increases, reducing throughput (FPS).
Geometric Drift: Existing cache eviction strategies (e.g., random or simple pruning) often discard geometrically critical tokens, leading to drift in long trajectories.

2. Methodology: OVGGT

The authors propose OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length, achieving O(1) constant-cost inference. It builds upon the StreamVGGT architecture but introduces two novel components:

A. Self-Selective Caching (SSC)

SSC compresses the KV cache to a fixed budget ( $B$ ) without requiring additional training or architectural changes. It leverages quantities already computed during the forward pass to determine token importance.

Activation Value Rating: Instead of using attention weights (which are inaccessible in FlashAttention), SSC uses the magnitude of the FFN (Feed-Forward Network) residuals.
- Mechanism: The FFN residual $\lambda \cdot \text{FFN}(\text{LN}(h))$ acts as a token-wise non-linear transformation. Its magnitude correlates with geometric salience (texture in shallow layers, geometry in mid-layers, semantic boundaries in deep layers).
- Benefit: Zero additional memory/compute overhead and full compatibility with FlashAttention.
Spatial Gaussian Smoothing: To prevent spatially fragmented retention (which degrades depth prediction), the activation scores are smoothed using a 2D Gaussian kernel. This encourages the retention of coherent token groups, preserving local geometric context.
Hybrid Scoring for Cache Compression:
- Current Tokens: Scored by FFN activation magnitude.
- Historical Tokens: Scored by Key-Vector Diversity (distance from the centroid key vector) to ensure diverse coverage of the scene.
- Selection: A weighted combination ( $\beta$ ) of these scores determines which tokens to evict, balancing immediate geometric importance with long-range distributional coverage.

B. Dynamic Anchor Protection (DAP)

To prevent geometric drift over extended trajectories, DAP explicitly shields "anchor" tokens from eviction.

Global Initial Anchor: All tokens from the first frame are permanently protected ( $P_{init}$ ). This preserves the world-coordinate origin and ensures coordinate-system consistency throughout the entire sequence.
Historical Anchors: As the camera moves, the first frame may lose visual overlap. DAP adaptively registers Historical Anchors ( $P_{hist}$ $P_{hi s t}$ ) based on view-overlap coverage.
- A new anchor is registered when the coverage ratio of the current view against the last anchor drops below a threshold ( $\tau$ ).
- Only the top- $\eta$ percentile of tokens (based on point cloud confidence) from these anchor frames are protected.
- A FIFO policy limits the number of active anchors to prevent unbounded growth.

3. Key Contributions

OVGGT Framework: The first training-free online streaming framework capable of processing arbitrarily long videos under fixed memory and compute constraints, eliminating the scaling bottleneck of causal-attention pipelines.
Self-Selective Caching (SSC): A novel cache management strategy using FFN residuals and spatial smoothing to compress the KV cache to a fixed budget while maintaining FlashAttention compatibility.
Dynamic Anchor Protection (DAP): A mechanism to shield coordinate-critical tokens (Global Initial + Historical Anchors) from eviction, effectively suppressing geometric drift in long-horizon scenarios.
SOTA Performance: Demonstrates state-of-the-art geometric accuracy on indoor, outdoor, and ultra-long sequence benchmarks, outperforming full-cache baselines in both accuracy and efficiency.

4. Experimental Results

Experiments were conducted on a single 32 GB NVIDIA RTX 5090 GPU across indoor (7-Scenes, NRGBD), outdoor (ETH3D), and ultra-long (Long3D, up to 10,000 frames) benchmarks.

Reconstruction Quality:
- OVGGT achieves SOTA accuracy in Accuracy (Acc), Completeness (Comp), and Normal Consistency (NC).
- Notably, OVGGT outperforms StreamVGGT (the full-cache baseline) in long sequences. This suggests that retaining all tokens introduces noise and redundancy that degrades quality; selective caching actually improves reconstruction.
- On 500-frame sequences, competing methods (StreamVGGT, Evict3R) suffer from OOM or severe geometric distortion, while OVGGT maintains sharp structures.
Efficiency (O(1) Cost):
- Memory: OVGGT maintains a constant VRAM footprint (approx. 10-12 GB for a 200K token budget) regardless of sequence length. In contrast, StreamVGGT exceeds 32 GB at ~200 frames.
- Throughput: OVGGT achieves the highest FPS (e.g., ~14.2 FPS on 500 frames) compared to baselines whose speed degrades as the cache grows.
Video Depth Estimation: OVGGT shows stable accuracy on long sequences (Bonn, KITTI), whereas other methods exhibit error accumulation.
Ablation Studies:
- Budget: 200K tokens is the optimal balance between accuracy and memory usage.
- Smoothing: Spatial smoothing ( $\alpha=0.5$ ) significantly improves reconstruction sharpness.
- Anchors: Removing DAP causes significant drift in long-range depth estimation; both Global and Historical anchors are essential.

5. Significance

OVGGT represents a paradigm shift in streaming 3D vision. It solves the fundamental trade-off between long-horizon inference and resource constraints.

Practical Deployment: By enabling continuous 3D reconstruction on consumer-grade GPUs (32GB) for arbitrarily long videos, it makes real-time applications like autonomous navigation, large-scale digital twin construction, and robotic manipulation feasible without expensive server clusters.
Theoretical Insight: The work demonstrates that "more context" (full cache) is not always better; intelligent, geometry-aware token selection can yield higher accuracy and efficiency.
Future Direction: It opens the door for "staged streaming inference," combining the bounded cost of causal models with periodic global refinement to further mitigate long-term drift.

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

1. The Smart Librarian (Self-Selective Caching)

2. The Safety Net (Dynamic Anchor Protection)

Why is this a big deal?

1. Problem Statement

2. Methodology: OVGGT

A. Self-Selective Caching (SSC)

B. Dynamic Anchor Protection (DAP)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes