BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Imagine you are trying to recreate a beautiful, complex 3D world (like a bustling city street or a forest) based on just a handful of blurry, low-quality photos you took with your phone.

This is the challenge of Novel View Synthesis (NVS). You want to look at the scene from a new angle that you didn't photograph, but because you only have a few "clues" (photos), the computer usually gets it wrong. It might invent weird, floating objects, make the walls look like melting wax, or leave huge holes where it doesn't know what to draw.

BetterScene is a new "magic tool" created by researchers at Ohio State University to fix these messy reconstructions. Here is how it works, explained through simple analogies:

1. The Problem: The "Blurry Sketch"

Think of existing methods (like standard 3D Gaussian Splatting) as a very fast artist who tries to draw a masterpiece based on only three blurry snapshots. They can get the general shape right, but the details are fuzzy. If they try to guess what's behind a tree they didn't see, they often hallucinate (imagine) weird, nonsensical things.

2. The Solution: The "Super-Editor"

The researchers used a powerful AI video generator (called Stable Video Diffusion) as their "Super-Editor." Think of this AI as a master painter who has seen billions of movies and knows exactly how light, shadows, and textures should look.

However, simply asking this master painter to "fix" the blurry sketch usually fails. Why? Because the painter's internal "notebook" (the Latent Space) where they store ideas is too small and rigid. It's like trying to write a complex novel on a sticky note; you have to leave out all the important details.

3. The Secret Sauce: Two New Tricks

The team realized that to get the master painter to do a good job, they needed to upgrade the "notebook" and the "rules" the painter follows. They introduced two key innovations:

A. The "Big Notebook" (High-Dimensional Latent Space)

The Analogy: Imagine the AI's internal notebook usually has 4 pages. The researchers expanded it to 64 pages.
The Benefit: With more pages, the AI can store much more detailed information about the scene. It can remember that the brick wall has a specific texture, or that the sign has a specific font, rather than just guessing "it's a wall."
The Catch: Usually, bigger notebooks make the AI slower and worse at creating new things. The researchers solved this by teaching the AI how to use these extra pages efficiently.

B. The "Magic Mirror" (Equivariance & Alignment)

The Analogy: Imagine you take a photo of a cat, then rotate the photo 90 degrees. If you ask the AI to describe the cat in the rotated photo, it should still recognize it as the same cat, just turned.
The Problem: Old AI models would get confused. They might think the rotated cat was a completely different animal, causing the video to "glitch" or jump around when you change the camera angle.
The Fix: The researchers trained the AI with a "Magic Mirror" rule. They taught it: "If I rotate the input, the internal representation must rotate in the exact same way." This ensures that when you move the camera, the scene moves smoothly and consistently, without sudden jumps or weird artifacts.

4. How It All Fits Together (The Assembly Line)

The BetterScene process works like a two-step assembly line:

Step 1: The Rough Draft (MVSplat): A fast, feed-forward model quickly builds a "rough draft" of the 3D scene from your few photos. It's fast but blurry and full of holes.
Step 2: The Polish (BetterScene): This rough draft is fed into the "Super-Editor" (the upgraded Video Diffusion model).
- The Editor looks at the rough draft.
- Using its Big Notebook (64 channels), it recalls high-quality details.
- Using its Magic Mirror rules, it ensures the details stay consistent as you move around the scene.
- It outputs a photorealistic, high-definition view that looks like you took a professional photo from that new angle.

Why This Matters

Previous methods were like trying to fix a low-resolution video by just sharpening the pixels; it often looked grainy or fake. BetterScene is like taking that low-res video and re-rendering it from scratch using a supercomputer that understands physics and lighting.

In short: BetterScene takes a few messy photos, uses a super-smart AI with a "bigger brain" and "better rules" to fill in the missing gaps, and gives you a crystal-clear, 3D world you can walk around in, even though you only had a few snapshots to start with.

1. Problem Statement

Novel View Synthesis (NVS) from sparse, unconstrained real-world photos is a critical yet challenging task. While methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have revolutionized rendering, they suffer from significant performance degradation in sparse-view settings.

Limitations of Current Methods:
- Traditional 3DGS/NeRF: Require dense inputs; sparse inputs lead to spurious geometry, missing regions, and artifacts.
- Existing Diffusion-Based Enhancers: Recent approaches use pre-trained video diffusion models (e.g., Stable Video Diffusion, SVD) to "hallucinate" missing details. However, they typically:
  1. Fine-tune only the denoising UNet while keeping the Variational Autoencoder (VAE) frozen (often using standard 4-channel latent spaces).
  2. Suffer from shift instability (temporal inconsistency) and limited hallucination capability in under-constrained regions.
  3. Fail to fully exploit the potential of the diffusion model's latent space, leading to a trade-off between reconstruction fidelity and generative quality.

2. Methodology: BetterScene

BetterScene proposes a two-stage framework that integrates a feed-forward 3DGS model with a representation-aligned, equivariance-regularized video diffusion model. The core innovation lies in re-engineering the VAE component of the diffusion pipeline rather than just the denoiser.

A. Core Components

Coarse Generation (Feed-Forward 3DGS):
- Uses MVSplat to generate initial coarse novel views and corresponding Gaussian feature latents from sparse input images. This bypasses expensive per-scene optimization.
Refinement (Video Diffusion Enhancer):
- Uses Stable Video Diffusion (SVD) as the backbone to refine the coarse views, removing artifacts and recovering view-consistent details.

B. Key Technical Innovations

The paper introduces two critical modifications to the SVD pipeline's VAE module:

High-Dimensional Latent Space (64 Channels):
- Standard diffusion models use 4 latent channels. BetterScene scales this to 64 channels (with 16x spatial downsampling) to increase the expressiveness of the latent space, allowing for better reconstruction of fine details.
- Challenge: Simply increasing dimensions often degrades generation performance due to the standard Gaussian prior assumption in VAEs.
Representation-Aligned VAE Training:
To solve the reconstruction-generation dilemma in high-dimensional spaces, the authors introduce two specific loss functions during VAE training:
- Vision Foundation Model Alignment Loss: The latent space is aligned with features from DINOv2 (a vision foundation model). This uses cosine similarity and distance similarity losses to ensure the latent features capture robust semantic and structural information, escaping the limitations of a standard Gaussian distribution.
- Equivariance Regularization: To ensure temporal consistency (crucial for video/sequence generation), the VAE is trained with an equivariance constraint. This ensures that if an input image is transformed (e.g., rotated/shifted), the latent representation transforms accordingly ( $Z(\tau \circ I) = \tau \circ Z(I)$ ). This prevents temporal flickering and scene shifts during inference.

C. Training Pipeline

Stage 1: Train the BetterScene-VAE using the alignment and equivariance losses on the DL3DV-10K dataset.
Stage 2: Freeze the trained BetterScene-VAE and fine-tune the SVD denoising UNet. The coarse Gaussian features from MVSplat are concatenated with noise latents as conditioning inputs, guided by CLIP embeddings for semantic coherence.

3. Key Contributions

Novel Framework: A unified approach combining feed-forward 3DGS (MVSplat) with a representation-aligned, equivariance-regularized video LDM for NVS.
Latent Space Optimization: Demonstrates that unconstrained high-dimensional latent spaces (64 channels), when guided by vision foundation models and equivariance constraints, significantly outperform standard low-dimensional (4-channel) VAEs in both reconstruction and generation.
Solving the Trade-off: Successfully addresses the optimization dilemma where increasing latent dimensions usually hurts generation quality, by introducing specific regularization losses.
State-of-the-Art Performance: Achieves superior fidelity and visual quality on real-world scenes compared to existing diffusion-based baselines.

4. Experimental Results

Dataset: Evaluated on DL3DV-10K, a large-scale dataset of unbounded real-world scenes (51.3M frames).
Setup: 5 input views $\rightarrow$ 56 novel views.
Quantitative Performance (Table 1):
- FID (Fréchet Inception Distance): BetterScene achieved 16.59, significantly outperforming MVSplat360 (18.89) and LatentSplat (34.55), indicating higher visual realism.
- LPIPS: Achieved 0.347 (lower is better), showing superior perceptual similarity to ground truth.
- SSIM: Achieved 0.579, the highest among all methods.
- PSNR: Maintained competitive scores (17.81) comparable to MVSplat360.
Qualitative Performance:
- Visual comparisons show BetterScene effectively removes artifacts (blur, ghosting) present in MVSplat and MVSplat360.
- It successfully recovers high-frequency details (e.g., text on walls) and maintains consistent geometry across novel views, which previous methods struggled with.
Ablation Study:
- Increasing latent channels from 4 (SD-VAE) to 64 (BetterScene-VAE) drastically improved reconstruction metrics (rFID dropped from 13.83 to 4.90).
- The 64-channel configuration with alignment/equivariance losses was essential for the final performance.

5. Significance and Future Work

Impact: BetterScene establishes a new paradigm for NVS by shifting focus from merely fine-tuning the denoiser to optimizing the latent representation of the generative model. It proves that high-dimensional, semantically aligned latent spaces are crucial for high-fidelity 3D synthesis.
Applications: Enables high-quality 3D scene reconstruction from sparse, "in-the-wild" photos, useful for AR/VR, digital twins, and robotics.
Limitations & Future: The current training process is computationally expensive due to the large model size and dataset requirements. Future work aims to explore more efficient video diffusion architectures to reduce training costs.

In summary, BetterScene bridges the gap between geometric reconstruction (3DGS) and generative priors (Diffusion) by fundamentally improving the latent space representation, resulting in the most photorealistic and consistent novel view synthesis from sparse inputs to date.