Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

Imagine you are trying to build a 3D model of a beautiful garden, but you only have three blurry photos of it taken from different angles.

Most current AI methods try to build this model by looking at those three photos and guessing what the rest of the garden looks like. The problem? They get really good at memorizing the three photos (so the photos look perfect), but they get the shape of the garden completely wrong.

It's like a painter who, instead of painting a tree, just paints a flat green blob that happens to look like a tree from exactly one angle. If you walk around the painting, the "tree" looks like a weird, floating smear. In the world of 3D graphics, these floating smears are called "floaters," and they make the new views look messy and fake.

This paper introduces a new method called ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization). Think of it as a "Quality Control Manager" for 3D reconstruction. Here is how it works, using simple analogies:

The Core Problem: The "Lying Artist"

In standard 3D AI, the system has two jobs:

Geometry (The Shape): Figuring out where objects are in 3D space.
Appearance (The Look): Figuring out what color and texture they have.

When you only have a few photos, the AI gets lazy. It realizes, "Hey, if I just change the color of this floating blob to match the photo perfectly, I get a high score!" So, it fixes the look but ignores the shape. The result is a scene that looks great from the camera's spot but falls apart if you move even an inch.

The Solution: Two-Step "Truth" System

ICO-GS forces the AI to stop lying by making the Shape and the Look work together, like a strict teacher and a diligent student.

Step 1: The "Detective" (Robust Geometry Regularization)

First, the system acts like a detective trying to find the true shape of the garden, even if the photos are tricky.

The Problem: Sometimes photos are blocked by leaves (occlusions) or the lighting is weird.
The Fix: The AI looks at all the photos and says, "Okay, 5 photos show a wall, but 2 show a tree. I'll trust the 5." It uses a "Top-K Selection" strategy (picking the most reliable clues) to ignore the bad, blocked, or confusing parts.
The Edge: It also knows that walls have sharp edges and grass has smooth gradients. It uses an "Edge-Aware" rule: "Keep the edges sharp, but smooth out the grass." This prevents the AI from creating a muddy, blurry mess.

Analogy: Imagine trying to assemble a puzzle in the dark. Instead of guessing, you only pick up pieces that clearly fit with at least half the other pieces you've already placed. You ignore the weird pieces that might be from a different puzzle.

Step 2: The "Virtual Mirror" (Geometry-Guided Appearance)

Now that the AI has a decent guess at the shape (thanks to Step 1), it uses that shape to fix the look.

The Problem: If the shape is wrong, the colors will be wrong too.
The Fix: The AI creates "Virtual Views." It takes its current 3D model and simulates taking a photo from a new angle that no one actually took.
The Safety Check: Before it uses these virtual photos to teach the AI, it runs a "Cycle Consistency" test. It asks: "If I project this point to the new view and then back to the old view, does it land in the same spot?" If the answer is "No," that part of the model is too shaky to trust, so the AI ignores it.
The Result: The AI learns the colors based on a reliable shape, not a floating guess.

Analogy: Imagine you are trying to learn what a statue looks like from all sides, but you only have a few photos. You build a rough clay model first. Then, you spin the clay model around to "imagine" what the back looks like. You only paint the back of the clay model if you are sure the clay is in the right place. If the clay is wobbly, you don't paint it yet. This ensures the paint (appearance) matches the structure (geometry).

Why This Matters

Previous methods were like a student who memorized the answers to a test but didn't understand the math. If you asked a slightly different question, they failed.

ICO-GS is like a student who understands the math. Because it forces the Shape and the Color to agree with each other:

No more "Floaters": The 3D objects stay solid and grounded.
Better Details: Even in tricky areas like leafy trees or smooth walls, the texture looks real.
Works with Few Photos: It can build a high-quality 3D world from just 3 or 4 photos, which is a huge deal for things like virtual reality or digital archiving where you can't take hundreds of pictures.

The Bottom Line

ICO-GS is a smarter way to build 3D worlds from 2D photos. It stops the AI from cheating by memorizing pictures and forces it to build a physically correct 3D structure first, then paint it. The result is a 3D scene that looks real, stays solid, and feels like a real place you could walk around in.

1. Problem Statement

Core Challenge: While 3D Gaussian Splatting (3DGS) achieves real-time, high-fidelity rendering in dense-view settings, it suffers severe degradation in sparse-view scenarios (e.g., 3–9 input images).
Root Cause: The paper identifies a fundamental intrinsic geometry-appearance discrepancy.

Under-constrained Geometry: With limited views, the photometric loss only constrains the projected appearance of Gaussians, not their 3D positions. This leads to depth ambiguity, allowing Gaussians to float or blur while still minimizing rendering error on training views.
Appearance Overfitting: To compensate for geometric errors, the optimization process adjusts Gaussian colors and opacities ("appearance compensation") to fit the training views. This creates a "shortcut" where the model learns plausible 2D textures for training views but fails to reconstruct the true 3D structure, resulting in severe artifacts (floaters, blurriness) on novel views.
Weak Textures: The problem is exacerbated in weakly-textured regions where appearance cues are insufficient to resolve geometric ambiguity.

Existing solutions often rely on external depth priors (which introduce scale ambiguity/noise) or dense initialization (which is forgotten during optimization), failing to enforce a mutual reinforcement between geometry and appearance.

2. Methodology: ICO-GS

The authors propose ICO-GS, a principled framework that enforces intrinsic consistency between geometry and appearance through two synergistic components.

A. Robust Geometric Regularization

To constrain geometry without external priors, the method enforces multi-view photometric consistency using deep features.

Feature-Based Matching: Instead of raw RGB (sensitive to lighting), the method uses features from a frozen pre-trained network to compute photometric consistency between warped views.
Pixel-wise Top- $k$ Selection: To handle occlusions prevalent in sparse views, the method computes photometric errors across all source views and retains only the top- $k$ most consistent correspondences. This adaptively filters out occluded or unreliable observations.
Edge-Aware Depth Smoothness: For regions visible in only one view (monocular), a smoothness term is applied to the depth map. It encourages smoothness in textureless areas while preserving sharp discontinuities at object boundaries using image gradients.

B. Geometry-Guided Appearance Optimization

To prevent appearance overfitting, the method uses the now-reliable geometry to supervise appearance learning via virtual views.

Cycle-Consistency Depth Filtering (CCDF): Before synthesizing virtual views, the method validates the rendered depth. It performs forward-backward warping between views; pixels where the re-projected depth matches the original depth within a threshold are marked as "reliable." This filters out noisy depth estimates.
Virtual View Synthesis: Using the filtered reliable depth, the method synthesizes virtual novel views by warping pixels from training images.
Virtual-View Photometric Loss: The synthesized virtual images are used as additional supervision. The model renders the scene from these virtual poses and minimizes the difference between the rendered and synthesized images. This forces the appearance to be consistent across viewpoints, guided by the geometrically correct depth, rather than overfitting to specific training angles.

C. Optimization Pipeline

The training follows a curriculum learning strategy:

Stage 1: Standard 3DGS optimization to establish coarse geometry.
Stage 2: Activation of geometric regularization (feature-based consistency + smoothness) to refine structure.
Stage 3: Activation of geometry-guided appearance optimization (virtual view consistency) to refine photometry and prevent overfitting.

3. Key Contributions

Identification of Intrinsic Consistency: The paper formally defines the coupled correctness of geometry and appearance as a fundamental principle for sparse-view 3DGS, revealing how their decoupling causes novel-view failure.
Robust Geometric Regularization: A novel approach using feature-based multi-view consistency with top- $k$ selection and edge-aware smoothing to constrain geometry without external priors.
Geometry-Guided Appearance: A mechanism to synthesize virtual views from cycle-filtered reliable depth, ensuring that appearance optimization is driven by accurate geometry rather than noisy estimates.
State-of-the-Art Performance: The framework achieves superior results across multiple benchmarks, particularly in challenging weakly-textured regions.

4. Experimental Results

The method was evaluated on LLFF (forward-facing), DTU (object-centric with weak textures), and Blender (360°) datasets under sparse settings (3, 6, and 9 views).

Quantitative Performance:
- LLFF (3-view): Achieved 22.20 PSNR, outperforming the previous best (ComapGS at 21.11) by +1.09 dB.
- DTU (3-view): Achieved 21.77 PSNR, outperforming BinocularGS (20.71) by +1.06 dB.
- Blender (8-view): Achieved 25.56 PSNR, the highest among all compared methods.
Qualitative Improvements:
- Significantly reduced "floater" artifacts and blurriness.
- Sharper structural boundaries and faithful texture recovery in weakly-textured regions (e.g., leaf gaps, smooth surfaces).
- Depth maps show clearer geometric edges compared to baselines.
Ablation Studies: Removing any single component (feature consistency, smoothness, cycle filtering, or virtual view loss) resulted in measurable performance drops (e.g., removing Cycle Consistency Depth Filtering caused a ~0.5 dB drop on DTU), validating the necessity of the full pipeline.

5. Significance

ICO-GS represents a significant advancement in sparse-view novel view synthesis by addressing the root cause of 3DGS failure: the lack of intrinsic consistency between geometry and appearance.

No External Priors: Unlike many competitors, it does not rely on potentially noisy or scale-ambiguous pre-trained depth models.
Self-Correcting Mechanism: By creating a feedback loop where geometry guides appearance and appearance (via virtual views) reinforces geometry, it solves the "chicken-and-egg" problem of sparse reconstruction.
Practical Impact: The method enables high-quality 3D reconstruction from very few images, making 3DGS more viable for real-world applications where capturing dense data is impractical (e.g., mobile photography, historical preservation).

The paper concludes that enforcing intrinsic consistency is the key to unlocking the full potential of Gaussian Splatting in sparse-view regimes.