Distractor-free Generalizable 3D Gaussian Splatting

Imagine you are trying to build a perfect, 3D hologram of a beautiful park using only a few photos taken by your phone. This is what 3D Gaussian Splatting (3DGS) does: it takes flat pictures and turns them into a 3D world you can walk around in.

However, there's a big problem in the real world: Distractors.

The Problem: The Unwanted Party Crashers

Imagine you take photos of a statue in a park. But in every photo, a bus drives by, a balloon floats past, or a group of tourists walks right in front of the statue.

The Old Way: If you try to build your 3D hologram using these messy photos, the computer gets confused. It tries to "glue" the bus and the balloon into the statue. The result? A glitchy, blurry mess where the statue has a bus-shaped hole or a floating ghost balloon.
The Limitation: Previous AI methods were like strict librarians who only worked in a quiet, empty room. They couldn't handle the chaos of the real world. If you wanted to remove the bus, you had to manually tell the computer, "Hey, that bus is bad," for every single scene. That's slow and impossible for a phone app that needs to work instantly.

The Solution: DGGS (The Smart Filter)

The authors of this paper, DGGS, invented a new system that acts like a super-smart, automatic editor that learns to ignore the party crashers on its own. They call it "Distractor-free Generalizable 3D Gaussian Splatting."

Here is how it works, using simple analogies:

1. The Training Phase: Learning by "Cross-Checking"

Imagine you are trying to figure out what a statue looks like, but you only have photos taken from different angles, and some have people walking in front of it.

The Old Trick: The computer looks at one photo, sees a person, and thinks, "Oh, that's part of the statue!" It gets confused.
The DGGS Trick: The system looks at all the photos together. It knows that the statue is solid and stays in the same place. The bus, however, moves.
- It asks: "If I look at the statue from Angle A, Angle B, and Angle C, does the bus appear in all of them?"
- Answer: No. The bus is only in some.
- Action: The system realizes, "Aha! The bus is an intruder!" It creates a mask (like a digital stencil) to paint over the bus and ignore it. It does this automatically without needing a human to tell it what a bus looks like. It learns that consistency = real object, and inconsistency = distractor.

2. The Inference Phase: The Two-Stage Cleanup

Once the system is trained, you give it a new set of messy photos to build a 3D model. It uses a two-step cleaning process:

Stage 1: The "Best Photo" Selection (Reference Scoring)
Imagine you have a pile of 10 photos to build your model. Some have a bus, some have a balloon, and some are clean.
- The system quickly scans all 10 photos and gives them a "cleanliness score."
- It picks the top 4 cleanest photos to do the heavy lifting. It ignores the messy ones for the main construction.
Stage 2: The "Ghost Buster" (Distractor Pruning)
Even with the best photos, a tiny bit of a bus might still be visible in the corner.
- The system builds the 3D model and then looks at it. If it sees a "ghost" (a floating piece of the bus that doesn't belong), it uses a digital pair of scissors to prune (cut out) those specific 3D particles.
- It's like a gardener trimming away the weeds that managed to sneak into the flower bed.

Why Is This a Big Deal?

It's "Generalizable": Previous methods were like a chef who only knew how to cook one specific dish. If you gave them a new ingredient, they failed. DGGS is like a master chef who can cook any dish, even if the kitchen is messy. It works on outdoor parks, indoor rooms, and new places it has never seen before.
It's Fast: It doesn't need to stop and think for hours. It works in a "feed-forward" way, meaning it looks at the photos and spits out a clean 3D model almost instantly.
It's Better Than the Experts: Surprisingly, this automatic system is even better at finding the "bad" parts than some manual, scene-specific methods that require hours of tuning.

The Bottom Line

DGGS is like giving your 3D camera a pair of smart glasses. When you take photos in a busy city with cars and people moving around, the glasses automatically blur out the moving stuff and focus only on the buildings and trees, letting you build a perfect, stable 3D world instantly. It turns the chaotic "wild" of the real world into a clean, usable digital reality.

1. Problem Statement

The paper addresses a critical, previously unexplored challenge in 3D reconstruction: Distractor-Free Generalizable 3D Gaussian Splatting (3DGS).

Context: Generalizable 3DGS aims to reconstruct 3D scenes from sparse reference images (feed-forward) without per-scene optimization. However, real-world "in-the-wild" data often contains distractors (transient objects like pedestrians, vehicles, or balloons).
Challenges:
1. Training Instability: Existing generalizable models rely on 3D consistency across views. Distractors break this consistency, causing the model to learn incorrect geometric relationships and leading to training instability.
2. Inference Artifacts: During inference, distractors in reference images cannot be properly projected into 3D space, resulting in "ghosting," holes, and artifacts in the synthesized novel views.
3. Limitations of Existing Solutions: Current distractor-free methods are mostly scene-specific, requiring iterative optimization, sufficient input data, or external priors (like SfM or pre-trained segmentation) that do not scale to generalizable, feed-forward settings.

2. Methodology

The authors propose DGGS, a framework comprising two novel components: a Distractor-Free Generalizable Training Paradigm and a Distractor-Free Generalizable Inference Framework.

A. Distractor-Free Generalizable Training

The core innovation is a Reference-based Mask Prediction and Refinement module that operates without scene-specific supervision.

Core Observation: Non-distractor regions in reference views maintain high 3D consistency. When 3DGS is inferred from references and re-rendered back to the reference views, the error in static (non-distractor) regions is low, while distractor regions show high error.
Reference-based Mask Prediction:
- Instead of relying solely on the query view's residual loss (which often misclassifies hard-to-reconstruct static regions as distractors), the method uses re-rendered reference views to generate a "Robust Mask" ( $M_{Rob}$ ).
- It filters $M_{Rob}$ by projecting non-distractor regions from references into the query view using multi-view consistency. This creates a Reference-based Mask ( $M_Q$ ) that effectively removes false positives (static regions misidentified as distractors).
Mask Refinement:
- Decoupling: The mask is decoupled into Disparity Error Areas (caused by view differences) and Distractor Areas.
- Segmentation & Filling: A pre-trained entity segmentation model (e.g., Entity Segmentation or SAM) is used to fill in the distractor regions identified in the query view.
- Auxiliary Loss: An auxiliary loss ( $L_A$ ) is introduced to supervise occluded regions in the query view that are visible in the references, ensuring the model learns geometry even where the query view is blocked.
- Final Loss: The training loss is modified to use the refined mask $M$ and the auxiliary loss, ensuring the model optimizes only on consistent, static regions.

B. Distractor-Free Generalizable Inference

To handle artifacts during the feed-forward inference phase, DGGS introduces a Two-Stage Framework:

Stage 1: Reference Scoring and Re-selection:
- Given a pool of candidate reference images, the system scores them based on the predicted distractor masks and disparity.
- It selects the top $N$ references that have the minimal distractor content and minimal disparity relative to the query view. This ensures the input to the decoder is as clean as possible.
Stage 2: Distractor Pruning:
- Even with clean references, residual distractors may persist due to imperfect masking.
- The system performs Distractor Pruning: It identifies 3D Gaussian primitives associated with distractor regions (based on projected masks) and prunes (removes) them from the 3D scene representation.
- Constraint: Pruning is only applied if the occlusion is not "common" across all references (to avoid removing valid static geometry that is simply occluded in some views).

3. Key Contributions

First Generalizable Distractor-Free Framework: DGGS is the first work to tackle distractor removal in a generalizable, feed-forward setting, moving beyond scene-specific optimization.
Reference-Based Mask Prediction: A novel paradigm that leverages 3D consistency across references to predict distractor masks, achieving higher accuracy than scene-specific methods even without explicit mask supervision.
Two-Stage Inference Strategy: Combines Reference Scoring (selecting the best inputs) with Distractor Pruning (cleaning the 3D output), effectively mitigating both training instability and inference artifacts.
Synthetic & Real-World Validation: The authors constructed synthetic distractor datasets based on Re10K and ACID to verify the method, alongside extensive testing on real-world datasets (On-the-go, RobustNeRF).

4. Experimental Results

The paper evaluates DGGS against state-of-the-art generalizable 3DGS methods (e.g., Mvsplat, Pixelsplat) and scene-specific distractor-free methods (e.g., NeRF-HuGS, SLS).

Quantitative Performance:
- On the RobustNeRF and On-the-go datasets, DGGS significantly outperforms baseline generalizable models.
- PSNR Improvement: DGGS achieves a mean PSNR of 21.74 compared to 15.45 for the baseline Mvsplat and 19.29 for the best existing scene-specific re-trained method (+SLS).
- Generalization: DGGS trained on distractor-rich data generalizes well to unseen scenes, whereas baselines suffer severe performance drops.
Qualitative Results:
- DGGS produces cleaner novel views with significantly fewer ghosting artifacts and holes compared to baselines.
- The predicted masks are more accurate than those from scene-specific methods, correctly identifying static regions that other methods mistakenly label as distractors.
Ablation Studies:
- Removing the Reference-based Mask Prediction leads to a significant drop in performance, confirming the importance of using reference consistency.
- The Distractor Pruning stage in inference provides the final boost in quality by removing residual artifacts.
- The method is robust to the choice of segmentation model (e.g., SAM2 vs. Entity Segmentation), indicating that 3D consistency is the primary driver of success.

5. Significance

Bridging the Gap: DGGS bridges the gap between controlled, static 3D reconstruction and the messy reality of "in-the-wild" mobile capture, making generalizable 3DGS viable for real-world applications like AR/VR and digital twins.
Paradigm Shift: It shifts the focus from "iterative optimization per scene" to "robust feed-forward inference," proving that high-quality 3D reconstruction is possible even with noisy, transient data without per-scene tuning.
Future Impact: The framework lays the groundwork for future community efforts in robust 3D reconstruction, potentially extending to dynamic scenes and broader real-world data challenges.

Limitations: The method still struggles with regions that are consistently occluded across all reference views (common occlusions), as the pruning strategy cannot distinguish between a distractor and a permanently hidden static object. Additionally, the reliance on segmentation models and two-stage inference introduces a slight computational overhead compared to raw feed-forward baselines.