Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

Imagine you are trying to build a 3D model of a room using only a series of 2D photos taken from different angles. This is the classic puzzle of Structure-from-Motion (SfM).

Traditionally, computers solve this by finding tiny, sharp details in the photos (like the corner of a table or a crack in the wall) and matching them up. It's like playing a game of "connect the dots" with very precise, reliable dots.

The Problem:
Recently, AI has gotten really good at guessing how deep a scene is just by looking at a single photo. This is called Monocular Depth Estimation (MDE). It's like having a super-intelligent artist who can look at a flat picture and instantly sketch a 3D version of it.

However, there's a catch. While this AI artist is fast and covers the entire picture (every single pixel), their sketches are a bit "noisy" or "wobbly." They aren't as precise as the tiny, sharp dots used in the old method. If you try to use the old "connect the dots" method with these wobbly AI sketches, the whole 3D model falls apart. The computer gets confused by the noise.

The Solution: Marginalized Bundle Adjustment (MBA)
This paper introduces a new way to use these "wobbly" but super-dense AI sketches to build 3D models. The authors call their method Marginalized Bundle Adjustment (MBA).

Here is how it works, using a few analogies:

1. The "Wobbly Crowd" vs. The "Sniper"

Old Method (Sniper): The old way relied on a few "sniper" shots—very precise, sparse points. If one sniper missed, it was a big problem.
New Method (The Crowd): The AI depth maps give you a "crowd" of millions of data points. Individually, many people in the crowd might be shouting the wrong thing (noise), but because there are so many of them, the truth is hidden in the crowd's overall behavior.

2. The "RANSAC" Analogy (The Voting System)

The paper is inspired by a technique called RANSAC, which is like a voting system for finding the truth in a noisy room.

Traditional RANSAC: Imagine asking 100 people, "Is this line straight?" If 51 say "Yes," you accept it. But this is a harsh "Yes/No" vote. If you set the bar too high, you ignore good data; too low, and you accept bad data.
The MBA Innovation: Instead of a harsh "Yes/No" vote, the authors created a smooth voting system. They look at the entire distribution of answers from the crowd.
- They ask: "How many people think the error is small? How many think it's medium? How many think it's huge?"
- Instead of picking one specific error limit, they calculate the Area Under the Curve of all these answers. They essentially say, "We don't need to pick a perfect threshold; let's just trust the shape of the crowd's opinion."

3. "Marginalizing" the Noise

The word "Marginalized" in the title is a fancy math term that simply means "averaging out the uncertainty."

Imagine you are trying to hear a friend speak in a noisy bar.

Old way: You try to pick out one specific word they said clearly. If you miss it, you fail.
MBA way: You listen to the entire conversation over time. Even if individual words are muffled by the noise, the overall pattern of the sentence becomes clear. The method mathematically "marginalizes" (averages out) the specific errors of individual pixels, allowing the dense, noisy data to actually help build a better model.

Why is this a Big Deal?

It's Dense: It uses every pixel, not just a few. This means it works even in smooth areas (like a blank wall) where the old "connect the dots" method fails because there are no dots to connect.
It's Robust: It doesn't break when the AI depth guess is a little bit wrong. It treats the "wobble" as a known quantity and works around it.
It Scales: The authors tested this on thousands of images (like a whole city or a large building). Other methods that try to use deep learning often crash because they run out of computer memory when the dataset gets too big. This method can handle massive projects.

The Result

By using this new "Crowd Voting" approach, the authors showed that you can take a standard AI depth model (which is usually just a rough guess) and turn it into a highly accurate 3D map. They beat or matched the best existing methods on many standard tests, proving that dense, noisy data is better than sparse, precise data if you know how to listen to the crowd.

In short: They taught the computer to stop looking for perfect dots and start listening to the "wisdom of the crowd" in the noisy depth maps, resulting in faster, more accurate 3D reconstruction.

1. Problem Statement

Structure-from-Motion (SfM) is a fundamental task in 3D vision aimed at recovering camera parameters (intrinsics and extrinsics) and scene geometry from multi-view images.

The Challenge: Traditional SfM relies on sparse feature matching and triangulation, which often fails in low-texture scenes or with limited parallax. Conversely, recent deep learning advances enable Monocular Depth Estimation (MDE) to provide dense structural priors from single images without motion cues.
The Gap: Integrating MDE into SfM pipelines is difficult because MDE produces dense but high-variance depth maps. Classical Bundle Adjustment (BA) assumes sparse, accurate point clouds and fails when applied directly to noisy, dense depth predictions. Existing methods either discard dense data to initialize sparse keypoints or rely on memory-intensive end-to-end networks that do not scale to large datasets.

2. Methodology: Marginalized Bundle Adjustment (MBA)

The authors propose a "Motion-from-Structure" approach that directly recovers camera motion from dense MDE outputs without per-pixel refinement, intervening only to resolve scale ambiguity via affine corrections.

Core Innovation: The MBA Objective

Inspired by RANSAC (Random Sample Consensus), the authors address the high variance of MDE predictions by moving away from discrete inlier counting (which is non-differentiable and threshold-sensitive) toward a continuous, probabilistic formulation.

Residual Distribution Modeling: Instead of using a single error threshold $\tau$ to classify pixels as inliers/outliers, the method treats the projective residuals of all dense pixels as a random variable following an empirical distribution.
CDF Integration: The authors observe that the count of inliers for a given threshold corresponds to the Cumulative Distribution Function (CDF), $F(\tau)$ , of the residuals.
Marginalization: To avoid sensitivity to a specific threshold, the objective maximizes the Area Under the Curve (AUC) of the empirical CDF up to a maximum threshold $\tau_{max}$ . This effectively "marginalizes out" the error threshold, integrating information across a range of thresholds.
Differentiable Surrogate Loss: Since analytical AUC maximization is intractable, they derive a differentiable surrogate loss:
$L_{MBA} = -\frac{1}{|R|} \sum F(r_{i,j,k}) \cdot \mathbb{1}[r_{i,j,k} < \tau_{max}]$
where $F(r)$ is the CDF value at residual $r$ . The backward pass suppresses gradients for extreme outliers (low probability), making the optimization robust to noise.

System Pipeline

Inputs: Unordered RGB frames, pre-computed dense depth maps (e.g., from DUSt3R), and dense correspondence maps.
Optimization Variables: Camera intrinsics ( $K$ ), extrinsics ( $P$ ), and per-frame affine depth corrections ( $\alpha, \beta$ ) to handle scale ambiguity.
Coarse-to-Fine Strategy:
- Coarse Stage: Uses a decomposed "star-shaped" subgraph for each frame and a logarithmic transformation of residuals to prevent early convergence to local minima caused by poorly registered frames.
- Fine Stage: Performs global Bundle Adjustment over the full pose graph using the standard MBA loss.
Scalability: The method subsamples dense data into a fixed-size matrix ( $|E| \times \kappa \times 5$ ), allowing parallelization across multiple GPUs. This enables global optimization on datasets with thousands of images (e.g., 8,000 frames), a scale where previous deep learning SfM methods fail due to memory constraints.

3. Key Contributions

First General Framework: The first framework to successfully integrate general-purpose MDE models into both small-scale and large-scale SfM and camera re-localization tasks.
Novel Objective Function: A principled, RANSAC-inspired objective (MBA) that handles dense, high-variance depth priors by marginalizing the error threshold via CDF integration. This formulation generalizes MAGSAC (a state-of-the-art 2-view estimator) to multi-view settings.
Scalability: Demonstrates the ability to perform global Bundle Adjustment on massive datasets (thousands of images) using distributed clusters, overcoming the memory bottlenecks of previous learning-based SfM approaches.
Zero-Shot Performance: Achieves State-of-the-Art (SoTA) or competitive results without scene-specific fine-tuning, leveraging the generalization power of foundation MDE models.

4. Experimental Results

The method was evaluated on diverse benchmarks covering indoor, outdoor, small-scale, and large-scale scenarios.

ETH3D (High-Res SfM): Achieved SoTA results, significantly outperforming classic COLMAP, learning-based DF-SfM, and specialized point-cloud methods like MASt3R-SfM.
IMC2021 (Internet Images): Ranked competitively against top methods (e.g., VGGT+BA) and outperformed MASt3R-SfM and FlowMap, demonstrating robustness to challenging internet imagery (sky, rivers, crowds).
Tanks & Temples (Large-Scale): Performed on-par or better than both feed-forward and optimization-based baselines, including methods that fail to converge on this dataset.
ScanNet: Outperformed COLMAP even when COLMAP was restricted to only the frames it could successfully register.
Camera Re-localization (7-Scenes & Wayspots):
- On 7-Scenes, achieved competitive accuracy (2nd place overall) against scene-specific regression methods, despite being scene-agnostic.
- On Wayspots (a map-free dataset with flipped images and no ground-truth depth), the method achieved SoTA performance, highlighting its ability to handle extreme geometric variations and scale changes.
Two-View RANSAC: When used as a scoring function for essential matrix estimation, MBA matched the performance of MAGSAC++, validating its theoretical foundation.

5. Significance and Impact

This paper bridges the gap between dense structural priors (from modern MDE foundation models) and classical geometric optimization (SfM).

Paradigm Shift: It challenges the notion that dense depth maps are too noisy for precise pose estimation, showing that with the right objective function (MBA), density can be leveraged to overcome variance.
Scalability: By avoiding memory-intensive end-to-end network inference for the entire scene, the method enables large-scale 3D reconstruction using commodity hardware clusters, making MDE-based SfM viable for real-world applications like robotics and large-scale mapping.
Generalization: The approach works "out-of-the-box" with various MDE models (DUSt3R, UniDepth, ZoeDepth) and correspondence models, offering a flexible and robust solution for multi-view geometry.

Code Availability: The authors have released their code at https://marginalized-ba.github.io/.

Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

1. The "Wobbly Crowd" vs. The "Sniper"

2. The "RANSAC" Analogy (The Voting System)

3. "Marginalizing" the Noise

Why is this a Big Deal?

The Result

1. Problem Statement

2. Methodology: Marginalized Bundle Adjustment (MBA)

Core Innovation: The MBA Objective

System Pipeline

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation