GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Imagine you are trying to build a 3D model of a room, but you only have a few blurry photos of it taken from different angles. This is a classic problem in computer vision: How do you fill in the missing pieces to make a perfect 3D world?

The paper introduces a new method called GIFSplat that solves this problem by acting like a super-fast, self-correcting artist.

Here is the breakdown using simple analogies:

1. The Problem: The "One-Shot" vs. The "Slow Sculptor"

Currently, there are two main ways computers try to build these 3D worlds:

The Slow Sculptor (Traditional Optimization): Imagine a sculptor who has a block of clay. They look at the photos, chisel a bit, step back, look again, chisel more, and repeat this thousands of times until it looks right.
- Pros: Very high quality.
- Cons: It takes forever (minutes or hours) and gets confused if the photos are sparse or weird.
The One-Shot Artist (Existing Feed-Forward AI): Imagine a magician who looks at the photos once and instantly snaps their fingers to produce a 3D model.
- Pros: Instant (milliseconds).
- Cons: If the photos are tricky, the model comes out with weird glitches, blurry textures, or missing parts. They can't go back and fix mistakes because they only get one try.

The Goal: We want the speed of the magician but the quality of the sculptor, without waiting for the sculptor to finish.

2. The Solution: GIFSplat (The "Iterative Refiner")

GIFSplat is like a magician who gets to peek at their own work and make tiny, instant corrections.

Instead of snapping their fingers once and being done, GIFSplat does this:

The First Snap: It makes a quick, rough guess at the 3D scene (just like the one-shot artist).
The "Check-Up": It looks at the rough guess and compares it to the original photos. It asks, "Where is this blurry? Where is the texture wrong?"
The Tiny Tweaks: Instead of starting over, it makes small, forward-only adjustments to fix those specific errors. It does this a few times (like 3 quick steps).
The Result: A high-quality 3D model created in seconds, not minutes.

3. The Secret Sauce: The "Generative Prior" (The Imagination Boost)

Sometimes, the photos are so sparse (like looking at a room from just two corners) that the computer has no idea what the missing wall looks like. It's like trying to guess the rest of a puzzle with half the pieces missing.

The Old Way: The computer would just guess randomly or leave a blurry hole.
The GIFSplat Way: It uses a frozen "Imagination Engine" (a pre-trained AI called a Diffusion model).
- Think of this engine as a super-artist who has seen millions of rooms.
- When the computer is stuck, it asks the Imagination Engine: "Hey, what does a door usually look like in this lighting?"
- The Engine doesn't rebuild the whole scene; it just sends a tiny note (a "cue") saying, "Make the door frame sharper here."
- GIFSplat uses this note to fix the 3D model instantly.

Crucially: The computer doesn't ask the Imagination Engine to do the work for it (which would be slow). It just asks for a hint and applies it instantly. This keeps the process fast.

4. Why is this a Big Deal?

Speed: It works in seconds (like a video game loading a level), whereas the high-quality methods take minutes.
Robustness: It works even when you have very few photos or photos from weird angles (out-of-domain data).
No "Training" at the End: Usually, to get a perfect result, you have to "train" the model on that specific scene for a long time. GIFSplat figures it out on the fly without needing that extra time.

Summary Analogy

Imagine you are trying to draw a portrait from a single, slightly blurry photo.

Old AI: Draws the whole face in one second. It looks okay, but the eyes are a bit off.
Traditional Optimization: Spends an hour staring at the photo, erasing and redrawing the eyes until they are perfect.
GIFSplat: Draws the face in one second. Then, it looks at the drawing, realizes the eyes are off, and quickly sketches over them to fix them. It then asks a "Mentor" (the Generative Prior) for a quick tip on how to make the hair look realistic, applies that tip, and finishes.

The result? A perfect portrait in the time it takes to draw a rough sketch.

1. Problem Definition

The paper addresses the limitations of current 3D reconstruction methods, specifically focusing on 3D Gaussian Splatting (3DGS) from sparse, unposed views. The field currently faces a trade-off between two paradigms:

Per-scene Optimization: Methods like standard 3DGS achieve high fidelity but require thousands of gradient descent steps during inference. This is computationally expensive (slow), fragile under sparse views, and struggles to incorporate external knowledge (priors) effectively.
Feed-Forward (One-Shot) Methods: Recent approaches (e.g., PixelSplat, AnySplat) infer 3D attributes in a single forward pass, offering millisecond-to-second inference times. However, they suffer from two main issues:
1. Limited Capacity: They are strictly bounded by model capacity, leading to artifacts and lower fidelity in complex scenes.
2. Lack of Refinement: They cannot adapt to specific scenes or correct residual errors after the initial prediction.
3. Prior Integration Failure: Existing attempts to integrate generative priors (e.g., diffusion models) into 3D reconstruction typically require iterative optimization loops that expand the view set, destroying the efficiency of feed-forward pipelines.

The Core Challenge: How to achieve feed-forward efficiency (seconds or less) while enabling scene-specific refinement and the injection of generative priors without requiring test-time gradient backpropagation or expensive optimization loops.

2. Methodology: GIFSplat

The authors propose GIFSplat, a framework that bridges the gap between one-shot feed-forward methods and optimization-based pipelines. It operates in two main stages:

A. Framework Overview

The system consists of three components:

Gaussian Initializer: A feed-forward network (based on AnySplat but without voxelization) that predicts camera poses and an initial set of 3D Gaussians ( $G^{(0)}$ ) from sparse input views.
Iterative Gaussian Head: A lightweight, weight-shared module that performs $T$ forward-only residual updates. Instead of backpropagating gradients, it predicts residual corrections ( $\Delta G$ ) to the current Gaussian state based on observation cues and prior cues.
Generative Prior Fusion: A module that distills a frozen diffusion model into lightweight cues to guide the refinement.

B. Iterative Residual Refinement (Forward-Only)

Unlike optimization methods that minimize loss via gradients, GIFSplat approximates the minimization of rendering discrepancies through a sequence of forward passes:

Observation Cues ( $o_i$ ): At each step $t$ , the current 3D scene is rendered. The system computes feature differences between the input images and the rendered views using a frozen feature extractor ( $\psi$ ). These pixel-level differences are pooled to the Gaussian level via soft assignment weights.
Residual Prediction: The iterative head takes the current Gaussian state and the observation cues to predict a residual update ( $\Delta G^{(t)}$ ).
Update Rule: The new state is $G^{(t+1)} = G^{(t)} + \Delta G^{(t)}$ . This process repeats for $T$ steps (typically 3), progressively refining geometry and appearance without ever computing gradients with respect to the scene parameters.

C. Generative Prior Fusion

To handle sparse views and domain shifts where observation cues are weak, the method injects generative knowledge:

Diffusion Enhancement: A frozen diffusion-based enhancer (Difix3D+) refines the intermediate rendered views to produce "enhanced" images with sharper textures and fewer artifacts.
Cue Distillation: Instead of re-optimizing the scene using these enhanced views (which would be slow), the system computes the feature-space difference between the enhanced rendering and the original rendering.
Prior Cues ( $p_i$ ): This difference is pooled to the Gaussian level, creating "prior cues" that represent high-frequency details and structural corrections.
Fusion: The observation cues and prior cues are concatenated and fed into the iterative head to guide the next residual update. This allows the model to "hallucinate" missing details in a feed-forward manner.

3. Key Contributions

Iterative Feed-Forward Mechanism: A novel update scheme that refines a fixed set of Gaussians via multi-step, forward-only residual updates. This enables scene-specific adaptation without test-time gradient descent.
Generative Prior Fusion without Optimization: A method to distill a frozen diffusion prior into lightweight Gaussian-level discrepancy cues. This injects generative knowledge into the refinement loop without backpropagation or the need to continuously expand the set of reference views (avoiding "view explosion").
Pose-Free Robustness: The framework operates without requiring known camera poses, making it suitable for unconstrained, sparse-view inputs.
Efficiency vs. Quality Balance: It maintains second-scale inference times while achieving reconstruction quality that rivals or exceeds optimization-based methods.

4. Experimental Results

The method was evaluated on DL3DV, RealEstate10K, and DTU datasets.

Quantitative Performance:
- RealEstate10K (2-view): GIFSplat outperformed state-of-the-art feed-forward baselines (e.g., AnySplat, FLARE, MVSplat) across all overlap settings. It achieved a PSNR improvement of up to +2.1 dB over the best baselines.
- DL3DV (8-view): Consistently achieved the highest PSNR, SSIM, and lowest LPIPS among feed-forward methods, even without camera pose inputs.
- DTU (Out-of-Domain): When trained on RealEstate10K and tested on DTU, GIFSplat showed superior generalization, outperforming baselines by over 2 dB in PSNR, demonstrating robustness to domain shifts.
Qualitative Results:
- GIFSplat produces sharper edges, more faithful textures, and significantly fewer artifacts (e.g., texture sticking, blurring) compared to one-shot methods.
- It successfully corrects implausible deformations (e.g., warped doors or wardrobes) using generative priors.
Efficiency:
- Inference time scales linearly with the number of refinement steps ( $T$ ).
- Even with 3 refinement steps and generative prior fusion, the total inference time remains in the second-scale, preserving the speed advantage of feed-forward methods.
Ablation Studies:
- Removing the iterative refinement caused the largest performance drop, confirming the necessity of the multi-step process.
- Removing the generative prior reduced perceptual fidelity (LPIPS), proving its value in under-constrained regions.
- Removing window attention significantly degraded performance, highlighting the importance of modeling local 3D relationships.

5. Significance and Impact

GIFSplat represents a significant shift in 3D reconstruction paradigms. It successfully decouples refinement capability from gradient-based optimization.

Practicality: By maintaining second-scale inference, it makes high-fidelity 3D reconstruction viable for real-time applications (AR/VR, robotics) where per-scene optimization is too slow.
Robustness: It solves the "sparse view" problem more effectively than previous feed-forward models by leveraging generative priors without the computational cost of iterative optimization.
Future Direction: It opens a new avenue for "iterative feed-forward" learning, suggesting that complex tasks requiring refinement can be solved via unrolled forward passes with shared weights rather than expensive backpropagation loops.

In summary, GIFSplat achieves the "best of both worlds": the speed of feed-forward inference and the quality/adaptability of optimization-based methods enhanced by generative AI.