DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Imagine you are trying to build a perfect 3D model of a bustling city street using only a video taken from a moving car. You want the model to be so detailed you can read the license plate on a distant car (high resolution), but you also need to make sure the buildings don't wobble or shift as the camera moves (global consistency).

The Problem:
Current AI models are like a team of two people trying to do this job, but they are stuck in a traffic jam.

The "Detail" Specialist: Can see the license plates and tiny cracks in the sidewalk perfectly, but if you show them a long video, they get confused. They forget that the building they saw in frame 1 is the same building in frame 100. The result is a shaky, jittery mess.
The "Big Picture" Specialist: Understands the whole city layout and keeps the buildings steady, but they are wearing thick foggy glasses. They can't see the license plates or the small details; everything looks blurry and smooth.

Existing models try to force one person to do both jobs. To keep the "Big Picture" person from getting overwhelmed, they have to show them blurry, low-resolution images. This means the final 3D model is always blurry, no matter how high-quality the original video was.

The Solution: DAGE (The Dual-Stream Architect)
The authors of this paper, DAGE, came up with a clever new team structure. Instead of forcing one person to do everything, they hired two specialists and a smart manager to coordinate them.

1. The Low-Resolution Stream (The "Big Picture" Manager)

What they do: This stream looks at the video, but it shrinks every frame down to a tiny thumbnail size (like a 540p or 252px image).
Why? Because the images are small, the computer can process thousands of frames at once without crashing. This allows the AI to understand the entire scene, figure out where the camera is moving, and ensure that the building on the left stays on the left throughout the whole video.
The Analogy: Think of this as looking at a map of the city. You can't see the cracks in the pavement, but you know exactly which way the streets go and how the buildings relate to each other.

2. The High-Resolution Stream (The "Detail" Artist)

What they do: This stream looks at the original, massive 4K or 2K video frames, one by one.
Why? Because it doesn't have to worry about the whole video at once, it can focus entirely on preserving sharp edges, tiny textures, and fine details.
The Analogy: This is like a painter looking at a single brick in a wall. They can see the texture, the moss, and the exact color, but they don't know how that brick fits into the whole building.

3. The Lightweight Adapter (The "Smart Manager")

What they do: This is the magic glue. It takes the "Big Picture" understanding from the first stream and injects it into the "Detail" stream.
How? Imagine the Detail Artist is painting a brick. The Manager whispers, "Hey, remember that brick is part of a tall tower, and the tower is leaning slightly to the left." The Artist then paints the brick with perfect detail, but in the correct position relative to the whole tower.
The Result: You get a 3D model that is sharply detailed (because of the High-Res stream) but globally consistent (because of the Low-Res stream).

Why is this a Big Deal?

Speed: Old models tried to do the "Big Picture" math on high-resolution images, which is like trying to solve a giant puzzle while wearing oven mitts. It's slow and heavy. DAGE does the heavy math on tiny images and the detailed work separately, making it 2x to 28x faster.
Scale: Old models would crash if you gave them a video longer than a few minutes or higher than 512 pixels. DAGE can handle 2K resolution and 1,000 frames (long videos) without breaking a sweat.
Quality: It produces 3D point clouds (the digital skeleton of the scene) that are so sharp you can see fine details like text on signs or thin wires, which previous models smoothed out into nothingness.

In Summary:
DAGE is like a construction crew where one team surveys the whole site to make sure the building is straight, while another team does the intricate brickwork. A foreman keeps them talking so the bricks are placed perfectly in the right spot. The result is a building that is both structurally sound and beautifully detailed, built faster than ever before.

1. Problem Definition

The paper addresses the challenge of estimating accurate, view-consistent 3D geometry (pointmaps) and camera poses from uncalibrated multi-view images or video sequences. While recent feed-forward visual geometry models (e.g., VGGT, Pi3) have achieved state-of-the-art (SOTA) results, they face three critical limitations:

Resolution Constraints: They are typically limited to low resolutions (e.g., $\le$ 518px) due to the quadratic computational cost of global attention mechanisms. This leads to blurred thin structures and poorly defined object boundaries.
Sequence Length: They struggle with long sequences (thousands of frames) because global attention scales poorly with the number of tokens.
Detail vs. Consistency Trade-off: Single-view estimators produce sharp details but lack temporal/multi-view consistency, while multi-view estimators ensure consistency but often sacrifice high-frequency details.

The goal is to create a system that simultaneously enforces global consistency, preserves fine-grained high-resolution details (up to 2K), and remains computationally tractable for long sequences.

2. Methodology: DAGE Architecture

DAGE (Dual-stream Architecture for efficient and fine-grained Geometry Estimation) introduces a novel dual-stream transformer design that decouples resolution from sequence length. The architecture consists of three main components:

A. Low-Resolution (LR) Stream

Purpose: To enforce global consistency and estimate camera poses efficiently.
Input: The entire video sequence is aggressively downsampled (e.g., long side $\le$ 252px).
Mechanism: It utilizes a Global Transformer with alternating frame-wise and global self-attention blocks (similar to Pi3/VGGT).
Optimization: By operating at low resolution, the computationally expensive global attention remains tractable, allowing the model to process thousands of frames without memory overflow.
Training: To prevent accuracy degradation from downsampling, the LR stream is initialized from a pre-trained teacher model (Pi3) and refined via feature distillation, where the student (LR) features are aligned with the teacher's higher-resolution features.

B. High-Resolution (HR) Stream

Purpose: To capture high-frequency details and preserve sharp boundaries.
Input: Original frames processed independently at their native resolution (up to 2K).
Mechanism: It employs a frozen MoGe2 (a 24-layer ViT) backbone. This leverages the strong zero-shot generalization and detail-preserving capabilities of state-of-the-art single-image geometry models.
Advantage: Since this stream processes frames independently, it avoids the quadratic scaling bottleneck of global attention regarding spatial resolution.

C. Lightweight Adapter

Purpose: To fuse global context from the LR stream into the HR stream without disrupting the HR stream's pre-trained feature space.
Mechanism: A series of Cross-Attention and Self-Attention blocks.
- Cross-Attention: HR tokens query the LR tokens to inject global, multi-view consistent context.
- Self-Attention: Re-calibrates intra-frame coherence after fusion.
Positional Encoding Strategy:
- Self-Attention: Uses Interpolated Rotary Positional Encodings (RoPE) to handle resolutions larger than the training context.
- Cross-Attention: Uses a "Snapping" mechanism, where HR tokens are aligned to the nearest grid cell of the LR feature map to avoid positional extrapolation errors across different scales.

D. Prediction Heads

Dense Geometry: A convolutional feature pyramid upsamples fused features to regress per-pixel 3D pointmaps.
Camera Poses: Regressed directly from the LR features (efficient, as poses do not require fine-grained details).
Metric Scale: A global scale token predicts a scene-wide metric scale factor.

3. Key Contributions

Dual-Stream Decoupling: The first architecture to decouple spatial resolution from sequence length. It confines heavy global attention to a low-resolution path while preserving high-resolution detail via a per-frame path.
Lightweight Fusion Adapter: A novel cross-attention mechanism that effectively injects global context into high-resolution features, resolving the conflict between global consistency and local sharpness.
Scalability: The model supports inputs up to 2K resolution and sequences of 1000+ frames, significantly outperforming prior art in throughput and memory efficiency.
Training Strategy: Utilizes knowledge distillation from a teacher model (Pi3) for the LR stream and leverages frozen single-image priors (MoGe2) for the HR stream, enabling training on relatively small datasets while achieving SOTA performance.

4. Experimental Results

DAGE was evaluated across four major tasks, demonstrating SOTA performance:

Video Geometry Estimation:
- Achieved the best average rank on 8 diverse datasets (including GMU, ScanNet, KITTI, Sintel, and high-res UrbanSyn/Unreal4K).
- Outperformed diffusion-based and transformer-based baselines (Pi3, VGGT, GeoCrafter) in both relative point error ( $Relp$ ) and inlier ratio ( $\delta_p$ ).
Sharp Depth Estimation:
- Achieved the highest F1 score and lowest Pseudo Depth Boundary Error (CPDBE) on synthetic datasets, indicating superior preservation of object boundaries and thin structures compared to oversmoothed baselines.
Multi-View Reconstruction:
- Matched SOTA accuracy and completeness on 7-Scenes and NRGBD benchmarks while recovering metric-accurate geometry without per-video optimization.
Camera Pose Estimation:
- Matched the accuracy of models running at 518px resolution while DAGE operated at a much lower resolution (252px) for the pose stream, demonstrating high efficiency.
Efficiency (Throughput):
- 2 $\times$ faster than Pi3 at 540p resolution.
- 28 $\times$ faster than Pi3 at 2K resolution.
- Remains tractable at 2K (5.6 FPS), whereas global-attention baselines often run out of memory (OOM).

5. Significance

DAGE represents a paradigm shift in visual geometry estimation by solving the "resolution vs. consistency" bottleneck.

Practical Deployment: By enabling 2K resolution and long-sequence processing with practical inference costs, it makes high-fidelity 3D reconstruction feasible for real-world applications like AR/VR, robotics, and autonomous driving.
Architectural Insight: It demonstrates that global consistency does not require processing high-resolution tokens globally; instead, a coarse global representation can effectively guide a fine-grained local representation.
Future Impact: The decoupling of resolution and sequence length opens new avenues for processing long-form video content and high-fidelity 3D scene reconstruction without the prohibitive computational costs of current global-attention models.