DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

DAGE introduces a dual-stream transformer architecture that efficiently estimates accurate, view-consistent geometry and camera poses from uncalibrated multi-view inputs by disentangling global coherence in a low-resolution stream from fine details in a high-resolution stream, achieving state-of-the-art performance while supporting high resolutions and long sequences.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to build a perfect 3D model of a bustling city street using only a video taken from a moving car. You want the model to be so detailed you can read the license plate on a distant car (high resolution), but you also need to make sure the buildings don't wobble or shift as the camera moves (global consistency).

The Problem:
Current AI models are like a team of two people trying to do this job, but they are stuck in a traffic jam.

  1. The "Detail" Specialist: Can see the license plates and tiny cracks in the sidewalk perfectly, but if you show them a long video, they get confused. They forget that the building they saw in frame 1 is the same building in frame 100. The result is a shaky, jittery mess.
  2. The "Big Picture" Specialist: Understands the whole city layout and keeps the buildings steady, but they are wearing thick foggy glasses. They can't see the license plates or the small details; everything looks blurry and smooth.

Existing models try to force one person to do both jobs. To keep the "Big Picture" person from getting overwhelmed, they have to show them blurry, low-resolution images. This means the final 3D model is always blurry, no matter how high-quality the original video was.

The Solution: DAGE (The Dual-Stream Architect)
The authors of this paper, DAGE, came up with a clever new team structure. Instead of forcing one person to do everything, they hired two specialists and a smart manager to coordinate them.

1. The Low-Resolution Stream (The "Big Picture" Manager)

  • What they do: This stream looks at the video, but it shrinks every frame down to a tiny thumbnail size (like a 540p or 252px image).
  • Why? Because the images are small, the computer can process thousands of frames at once without crashing. This allows the AI to understand the entire scene, figure out where the camera is moving, and ensure that the building on the left stays on the left throughout the whole video.
  • The Analogy: Think of this as looking at a map of the city. You can't see the cracks in the pavement, but you know exactly which way the streets go and how the buildings relate to each other.

2. The High-Resolution Stream (The "Detail" Artist)

  • What they do: This stream looks at the original, massive 4K or 2K video frames, one by one.
  • Why? Because it doesn't have to worry about the whole video at once, it can focus entirely on preserving sharp edges, tiny textures, and fine details.
  • The Analogy: This is like a painter looking at a single brick in a wall. They can see the texture, the moss, and the exact color, but they don't know how that brick fits into the whole building.

3. The Lightweight Adapter (The "Smart Manager")

  • What they do: This is the magic glue. It takes the "Big Picture" understanding from the first stream and injects it into the "Detail" stream.
  • How? Imagine the Detail Artist is painting a brick. The Manager whispers, "Hey, remember that brick is part of a tall tower, and the tower is leaning slightly to the left." The Artist then paints the brick with perfect detail, but in the correct position relative to the whole tower.
  • The Result: You get a 3D model that is sharply detailed (because of the High-Res stream) but globally consistent (because of the Low-Res stream).

Why is this a Big Deal?

  • Speed: Old models tried to do the "Big Picture" math on high-resolution images, which is like trying to solve a giant puzzle while wearing oven mitts. It's slow and heavy. DAGE does the heavy math on tiny images and the detailed work separately, making it 2x to 28x faster.
  • Scale: Old models would crash if you gave them a video longer than a few minutes or higher than 512 pixels. DAGE can handle 2K resolution and 1,000 frames (long videos) without breaking a sweat.
  • Quality: It produces 3D point clouds (the digital skeleton of the scene) that are so sharp you can see fine details like text on signs or thin wires, which previous models smoothed out into nothingness.

In Summary:
DAGE is like a construction crew where one team surveys the whole site to make sure the building is straight, while another team does the intricate brickwork. A foreman keeps them talking so the bricks are placed perfectly in the right spot. The result is a building that is both structurally sound and beautifully detailed, built faster than ever before.