Semantic-Guided 3D Gaussian Splatting for Transient Object Removal

Imagine you are trying to create a perfect, 3D hologram of a beautiful park using hundreds of photos taken from different angles. This is what 3D Gaussian Splatting (3DGS) does: it takes flat pictures and builds a 3D world that you can walk around in.

But here's the problem: in real life, people walk through the park, birds fly by, and balloons float past. When the computer tries to build the 3D model, it gets confused. It sees a person in one photo, a tree in another, and a person in a third. Instead of building a clean park, it creates a ghostly mess—a semi-transparent, blurry blob where the person walked. This is called "ghosting."

This paper introduces a clever new way to clean up these ghosts using AI that understands language and images, rather than just looking for movement.

The Old Way: The "Moving Detective"

Previously, computers tried to remove these ghosts by acting like a motion detective. They would say, "Hey, that pixel moved! It must be a person walking. Let's delete it."

The Flaw: This is like trying to clean a room by only throwing away things that move. But what if a static object (like a wall) looks different because the camera moved? The computer might think the wall is moving and delete it, or it might miss a person who stood perfectly still. It gets confused by parallax (how things look different from different angles).

The New Way: The "Smart Librarian"

The authors propose a new method called Semantic-Guided 3D Gaussian Splatting. Instead of asking "Did it move?", they ask "What is it?"

Think of the 3D scene not as a pile of pixels, but as a library of millions of tiny, glowing dots (called Gaussians). Each dot represents a tiny piece of the world.

The Librarian (CLIP): The team uses a powerful AI called CLIP (which is like a librarian who has read every book and seen every picture in the world). This librarian knows what a "person," a "balloon," or a "hand" looks like, and what a "wall" or "building" looks like.
The Tagging Process: As the computer builds the 3D model, it shows the librarian the current view. The librarian says, "Oh, that dot looks like a person," or "That dot is definitely a wall."
The Scorecard: Every single glowing dot gets a score.
- If a dot is seen often and looks like a wall, it gets a "Keep" score.
- If a dot is seen often but looks like a person, it gets a "Trash" score.
The Cleanup: The computer slowly fades out (regularizes) the dots with high "Trash" scores and eventually deletes them. The dots that look like walls are kept safe, even if they only appeared in a few photos.

The Magic Analogy: The "Ghost Hunter" vs. The "Name Tag"

The Old Method (Motion): Imagine trying to find a thief in a crowd by only watching who runs. If the thief stands still, you miss them. If a bystander trips, you might arrest them by mistake.
The New Method (Semantic): Imagine everyone in the crowd is wearing a name tag. You don't care if they are running or standing; you just look at the tag. If the tag says "Thief," you remove them. If the tag says "Bystander," you keep them, even if they are standing in a weird spot.

Why This Matters

No More Ghosts: The result is a clean 3D park without blurry, floating ghosts of people who walked through.
Lightweight: Unlike other methods that require massive computer power and memory (like trying to store a whole library in your head), this method is very efficient. It keeps the 3D model small and fast, so it can still be viewed in real-time (like a video game).
Smart Decisions: It correctly kept a wall that was only visible in 15% of the photos because the AI knew it was a "building," not because it was moving.

The Catch

The system isn't perfect yet. You have to tell the AI what you want to remove beforehand (e.g., "Please remove people and balloons"). If you don't tell it, it won't know. Also, if a person is very far away and tiny, the AI might not recognize them clearly.

In a Nutshell

This paper teaches computers how to understand what objects are instead of just watching how they move. By using a language-savvy AI to label the tiny building blocks of a 3D world, they can surgically remove unwanted distractions (like walking people) while keeping the beautiful, static scenery perfectly intact. It's like having a smart editor that knows the difference between the main character and the background extras, ensuring the final movie is clean and clear.

1. Problem Statement

3D Gaussian Splatting (3DGS) has emerged as a highly efficient method for real-time novel view synthesis, outperforming implicit methods like NeRF in training speed and rendering performance. However, standard 3DGS assumes a static scene. When multi-view captures contain transient objects (e.g., pedestrians, moving items), the optimization process treats these inconsistent observations as part of the scene geometry. This results in ghosting artifacts (semi-transparent, smeared objects) in the reconstructed 3D model.

Existing solutions face significant trade-offs:

Implicit Volumetric Methods (e.g., RobustNeRF): Effective but computationally expensive and slow to train.
Motion/Visibility-Based Filtering: Prone to parallax ambiguity. Static geometry viewed from few angles (due to camera motion) can be mistaken for transient objects, leading to over-pruning of valid scene elements.

2. Methodology: CLIP-GS Framework

The authors propose CLIP-GS, a semantic-guided framework that integrates Vision-Language Models (specifically CLIP) into the 3DGS optimization pipeline to distinguish between static and transient elements based on category, rather than motion or visibility frequency.

A. Semantic Scoring via CLIP

Instead of relying on geometric cues, the method uses CLIP to classify rendered views against predefined text prompts:

Distractor Prompts ( $D$ ): Text describing transient categories (e.g., "a photo of a person," "pedestrians").
Static Prompts ( $S$ ): Text describing permanent elements (e.g., "a photo of a building," "furniture").
Process: During training, rendered images are passed through the CLIP vision encoder. Cosine similarity is computed between image features and distractor prompts. A normalized distractor score ( $\hat{s}_d$ ) is calculated; scores $>0.5$ indicate the presence of transient objects.

B. Per-Gaussian Score Accumulation

To apply this logic to the 3D structure, semantic evidence is accumulated at the individual Gaussian level:

For each Gaussian $G_j$ , the system tracks an accumulated score ( $\tilde{s}_j$ ) and a view count ( $n_j$ ).
If a Gaussian is visible in a view with a high distractor score, its accumulated score is increased.
The final normalized semantic score ( $s_j$ ) is the ratio of accumulated score to view count. This ensures that high-frequency viewpoints do not disproportionately bias the score, focusing instead on category consistency.

C. Category-Aware Suppression

The framework employs two mechanisms to remove transient Gaussians while preserving static geometry:

Opacity Regularization: A semantic loss term ( $L_{CLIP}$ ) is added to the photometric loss. This term penalizes the opacity ( $\alpha$ ) of Gaussians with high semantic scores, gradually suppressing them during optimization.
Periodic Pruning: At fixed intervals, Gaussians are removed if they exceed a calibrated semantic threshold ( $\tau$ ) or if they are geometrically unstable (low view count and low opacity).

3. Key Contributions

Semantic Resolution of Parallax Ambiguity: The primary innovation is using semantic classification to resolve the ambiguity where static objects (like a wall seen from few angles) are mistaken for transients. By identifying objects as "buildings" vs. "people," the system preserves static geometry regardless of visibility frequency.
Lightweight Integration: Unlike methods that require dense semantic embeddings throughout the rendering pipeline, CLIP-GS uses CLIP only during training to guide structural pruning. This preserves the real-time rendering and low memory footprint of vanilla 3DGS.
Dual Suppression Mechanism: The combination of continuous opacity regularization and discrete periodic pruning provides robust removal of transient artifacts.

4. Experimental Results

The method was evaluated on the RobustNeRF benchmark (Statue, Android, Yoda, Crab sequences) and compared against Vanilla 3DGS and Mip-NeRF 360.

Quantitative Performance:
- PSNR: CLIP-GS achieved the highest PSNR in 3 out of 4 sequences, with improvements of up to +1.94 dB over Vanilla 3DGS (Statue) and +0.92 dB over Mip-NeRF 360 (Android).
- Perceptual Quality: Consistent improvements were observed in SSIM and LPIPS metrics.
Ablation Studies:
- Opacity regularization alone yielded +0.5 dB gain; periodic pruning alone yielded +0.8 dB. The combined framework achieved the maximum +1.3 dB gain.
- Threshold calibration was critical; an optimal threshold range of $\tau \in [0.015, 0.02]$ was identified.
Qualitative Results:
- Visual comparisons showed the complete removal of ghosting artifacts (e.g., pedestrians in the "Statue" sequence) while correctly retaining static elements (e.g., walls visible in only 15% of views).
- Vanilla 3DGS and Mip-NeRF 360 exhibited significant ghosting.
Efficiency: The method maintained minimal memory overhead (only two additional scalar arrays per Gaussian) and preserved real-time rendering capabilities.

5. Significance and Limitations

Significance:
This work bridges the gap between high-fidelity 3D reconstruction and real-world dynamic capture scenarios. It demonstrates that semantic reasoning is a more robust filter for transient objects than geometric or motion-based heuristics, enabling high-quality 3DGS reconstruction in casual, uncontrolled environments without sacrificing the efficiency that makes 3DGS popular.

Limitations:

Prompt Dependency: Users must specify distractor categories (e.g., "person") before training, though generic categories work well across scenes.
Small Object Handling: CLIP struggles with very small objects (<50 pixels), leading to incomplete removal of distant transient objects.
Threshold Tuning: The pruning threshold requires dataset-specific calibration, though the optimal range is narrow.

Future Work: The authors suggest investigating patch-level semantic scoring for better small-object localization, learned prompt generation to reduce manual specification, and adaptive thresholding for diverse capture conditions.