MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

Imagine you are trying to build a massive, 3D digital twin of a entire city using thousands of photos taken by tourists. Some photos are of the Eiffel Tower, some of the Louvre, and some are just blurry shots of the street. You have no idea what order they were taken in.

The Problem: The "All-at-Once" Bottleneck
Recently, scientists invented super-smart AI models (like VGGT or Pi3) that can look at a few photos and instantly figure out exactly where the camera was and build a 3D model. These AIs are like genius architects who can design a whole building in their head just by looking at a blueprint.

However, these geniuses have a major flaw: they have terrible short-term memory.
If you try to show them 1,000 photos at once, their brain (the computer's GPU memory) explodes. It's like trying to read a 1,000-page book in a single second; the pages just blur together, and the computer crashes. To make them work, people usually have to throw away 90% of the photos, leaving out huge parts of the city.

The Solution: MERG3R (The Smart Project Manager)
The authors of this paper created a new system called MERG3R. Think of MERG3R not as an architect, but as a brilliant Project Manager who knows how to organize a massive construction crew.

Instead of asking the genius architect to look at the whole city at once, MERG3R uses a "Divide and Conquer" strategy. Here is how it works, step-by-step:

1. Sorting the Chaos (The "Pseudo-Video")

First, MERG3R takes the messy pile of 1,000 unordered photos. It looks at them and says, "Okay, these two photos look like they were taken near the same tree, and these three look like they are near the river."
It rearranges the photos into a logical sequence, like a movie reel, even though they weren't taken in that order originally. It creates a smooth path through the city.

2. The Team Split (Divide)

Now, instead of giving the whole movie to one architect, MERG3R cuts the movie into small, overlapping chapters.

Chapter 1: The Eiffel Tower area.
Chapter 2: The river and the bridge (overlapping with Chapter 1).
Chapter 3: The Louvre (overlapping with Chapter 2).

Crucially, it doesn't just cut them in a straight line. It shuffles the chapters slightly so that every team member sees a mix of angles. This ensures that when they build their small piece, they have enough different viewpoints to get the 3D shape right.

3. Independent Construction (Local Reconstruction)

Now, the system sends each small chapter to a different AI model (or the same model running on different computers).

Team A builds a perfect 3D model of the Eiffel Tower.
Team B builds a perfect 3D model of the river.
Team C builds the Louvre.

Because each team only has to look at a small chunk of photos, their "memory" doesn't explode. They can do their job perfectly and quickly.

4. The Handshake (Alignment)

Here is the tricky part: Team A's Eiffel Tower might be slightly rotated differently than Team B's river. They need to fit together like puzzle pieces.
MERG3R looks at the overlapping areas (the bridge that appears in both Team A and Team B's photos). It uses a "handshake" protocol to rotate and shift the models until they snap together perfectly. It's like a group of people holding hands in a circle; if one person moves, everyone adjusts slightly to keep the circle connected.

5. The Final Polish (Bundle Adjustment)

Finally, MERG3R runs a global "stress test." It looks at the entire assembled city and asks, "Does this look physically possible?" It tweaks the camera positions and the 3D points slightly to make the whole thing smooth and consistent, removing any wobbles or gaps.

Why is this a Big Deal?

Memory Magic: While other methods need a super-computer with 64GB of memory (and still crash with too many photos), MERG3R can do the same job with a standard laptop or a single graphics card. It uses about 8.5 GB instead of 64 GB.
Speed: It finishes the job in 8.5 minutes instead of taking forever or failing completely.
No Quality Loss: Even though it breaks the problem into small pieces, the final result is just as accurate as if the AI had seen all the photos at once (if it could have).

The Analogy in a Nutshell:
Imagine trying to solve a 10,000-piece jigsaw puzzle.

Old Way: You try to dump all 10,000 pieces on a tiny table at once. You can't see anything, you knock pieces off, and you give up.
MERG3R Way: You sort the pieces into 20 piles based on color. You give one pile to 20 different people. They each solve their small section perfectly. Then, you take the edges where the piles overlap, match them up, and tape the sections together. Finally, you smooth out the seams.

MERG3R allows us to build massive, high-quality 3D worlds from thousands of photos without needing a supercomputer, making 3D reconstruction accessible, fast, and reliable for everyone.

1. Problem Statement

Recent advancements in neural visual geometry (e.g., VGGT, Pi3, Mast3R) have achieved state-of-the-art (SOTA) accuracy in 3D reconstruction by using transformer-based architectures to jointly infer camera poses and dense point clouds. However, these models face a critical scalability bottleneck:

Memory Constraints: They rely on full self-attention mechanisms where computational cost and memory usage grow quadratically ( $O(N^2)$ ) with the number of input images ( $N$ ).
Hardware Limits: Processing large, unordered image collections (e.g., thousands of images for city-scale modeling) often exceeds GPU memory capacity, leading to Out-Of-Memory (OOM) errors.
Trade-offs: Existing attempts to improve scalability (e.g., token merging, chunking) often degrade geometric accuracy, fail to handle unordered inputs effectively, or still require simultaneous encoding of all images.

Goal: Develop a framework that allows modern geometric foundation models to reconstruct large, unordered image sets with high global accuracy while staying within native GPU memory limits.

2. Methodology: MERG3R

MERG3R is a training-free, divide-and-conquer framework that acts as a wrapper around existing geometric foundation models. It consists of four main stages:

A. Image Set Ordering and Partitioning

To handle unordered inputs, MERG3R first imposes a "pseudo-temporal" order and partitions the images into manageable, overlapping subsets.

Pseudo-Video Construction: It computes a dense visual-similarity matrix (using DINO features) between all image pairs. It then approximates a Hamiltonian path through the images that maximizes visual continuity, creating a pseudo-video sequence.
Interleaved Sampling: To ensure geometric diversity within each subset (preventing clusters from containing only nearly identical views), the sequence is permuted using interleaved sampling. This distributes views from across the entire trajectory into every subset.
Sliding Window: The interleaved sequence is split into overlapping subsets (clusters) using a sliding window with a fixed stride. This ensures sufficient overlap between adjacent clusters for global alignment.

B. Local Reconstruction

Each subset is processed independently by a pre-trained geometric foundation model (e.g., VGGT, Pi3).

Because the subset size ( $T$ ) is small, the memory complexity drops from $O(N^2)$ to $O(K \cdot T^2)$ (where $K$ is the number of clusters).
This step produces local camera poses, depth maps, and confidence scores for each cluster.
Parallelization: Different subsets can be processed in parallel across multiple GPUs.

C. Cluster Alignment

The locally reconstructed clusters are aligned into a common reference frame.

Weighted Iterative Similarity Transform: Using the overlapping regions between adjacent clusters, the method identifies corresponding 3D points.
Confidence Filtering: Points with low confidence scores are filtered out.
Optimization: A similarity transform ( $Sim(3)$ ) is solved using Iteratively Reweighted Least Squares (IRLS) with a Huber loss to minimize reprojection errors, robustly aligning the clusters.

D. Global Tracking and Bundle Adjustment (BA)

To achieve global consistency and refine the reconstruction:

Scalable Tracking: Instead of pairwise matching (which is $O(N^2)$ ), MERG3R builds a sparse $k$ -NN graph based on the visual similarity matrix. It uses LightGlue for feature matching on these edges.
3D Consistency Check: Raw matches are lifted to 3D using predicted depths and reprojected to verify geometric consistency, filtering out false positives.
Confidence-Weighted Global BA: The system performs a global optimization over all multi-view tracks. It jointly optimizes camera intrinsics, extrinsics, and 3D point positions by minimizing a confidence-weighted 2D reprojection error. This step is more efficient than optimizing over every image pair in the scene graph.

3. Key Contributions

Training-Free Scalability: A modular pipeline that enables SOTA neural geometry models to operate on datasets far exceeding their native memory limits without retraining.
Novel Partitioning Strategy: Demonstrates that interleaved sampling is crucial for local reconstruction quality, ensuring diverse viewpoints within clusters and preventing the "single-facade" problem common in simple sliding windows.
Efficient Global Alignment: Introduces a confidence-weighted global bundle adjustment that leverages sparse graph-based tracking, offering better global consistency and efficiency compared to optimizing over all image pairs.
Model Agnosticism: The framework is compatible with any pre-trained geometric foundation model (e.g., VGGT, Pi3, FastVGGT).

4. Experimental Results

The authors evaluated MERG3R on 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, comparing against SOTA baselines (VGGT, Pi3, Mast3R-SfM, CUT3R, TTT3R, etc.).

Scalability & Memory:
- On 1,000-image sequences, baseline models (VGGT, Pi3) fail with OOM errors. MERG3R processes these successfully with ~20GB memory (vs. >64GB for baselines) and in ~8.5 minutes (vs. >20 minutes).
- Memory consumption remains stable regardless of the total number of input images.
Accuracy:
- Camera Pose: On 7-Scenes (1,000 images), MERG3R + Pi3 achieved the best Relative Rotation Accuracy (RRA@30: 100%) and Relative Translation Accuracy (RTA@30: 97.69%), outperforming VGGT-Long and other baselines.
- Point Cloud: On NRGBD and 7-Scenes, MERG3R maintained high accuracy and completeness, whereas methods like CUT3R and TTT3R degraded rapidly as image count increased.
- Outdoor Scenes: On Cambridge Landmarks, MERG3R showed superior robustness in challenging outdoor environments compared to traditional SfM (GLOMAP, InstantSfM) and other neural methods.
Ablation Studies:
- Ordering: The pseudo-video ordering derived from unordered images performed nearly identically to ground-truth video ordering.
- Splitting Strategy: The proposed interleaved sampling significantly outperformed graph-based clustering and simple sliding windows in terms of pose accuracy (ATE, RRE, RTE).
- Tracking: The graph-based tracking with LightGlue provided better geometric consistency than using the foundation model's native tracking module for global BA.

5. Significance

MERG3R bridges the gap between the high accuracy of modern neural visual geometry and the practical requirements of large-scale 3D reconstruction.

Democratization: It reduces reliance on massive GPU clusters, making high-quality 3D reconstruction accessible on standard hardware.
Practical Application: It enables applications like city-scale modeling, cultural heritage preservation, and autonomous navigation where input data is often unordered and massive.
Future Direction: It establishes a paradigm of merging traditional geometric optimization (bundle adjustment, graph-based tracking) with deep learning foundation models, suggesting a path forward for scaling neural geometry without being constrained by quadratic attention mechanisms.