A Single Image and Multimodality Is All You Need for Novel View Synthesis

Imagine you are trying to paint a 3D movie scene based on just one single photograph. You want to move the camera around that photo to see what's behind the trees, to the left of the car, or around the corner. This is called "Novel View Synthesis."

For a long time, AI has been trying to do this by guessing the depth (how far away things are) just by looking at the picture. But here's the problem: AI is bad at guessing depth in tricky situations. If a wall is plain white, if it's raining, or if a car is blocking the view, the AI gets confused. It might think a flat wall is a deep cave, or that a distant mountain is right next to the camera. When the AI tries to move the camera based on these bad guesses, the movie looks glitchy, warped, and inconsistent.

This paper proposes a simple but brilliant solution: "Don't just guess; use a little bit of real data."

Here is the breakdown of their idea using everyday analogies:

1. The Problem: The "Blind Painter"

Think of the current AI (which only looks at the photo) as a blind painter trying to recreate a 3D room. They have to guess where the furniture is based only on shadows and colors.

The Issue: If the room is foggy or the furniture is plain, the painter guesses wrong. When they try to "move" the camera to show the back of the sofa, they might paint the sofa floating in mid-air or stretching into infinity. The result is a messy, unrealistic video.

2. The Solution: The "Radar Flashlight"

The authors say, "Let's give the painter a Radar Flashlight (or a LiDAR sensor, like what self-driving cars use)."

This sensor doesn't take a pretty picture; it just sends out a few "pings" to measure distance.
The Catch: It's very sparse. Imagine throwing a handful of darts at a wall and only hitting a few spots. You don't have a full picture of the wall, just a few dots telling you, "Hey, there is a wall here, about 10 feet away."

3. The Magic Trick: The "Smart Connector" (Gaussian Processes)

Now, how do you turn a few scattered dots into a full, smooth wall?
The authors use a mathematical tool called a Gaussian Process. Think of this as a super-smart rubber sheet.

You pin your few "dart hits" (the radar data) onto the sheet.
The rubber sheet naturally stretches and fills in the gaps between the darts, creating a smooth, continuous surface.
The Best Part: The sheet knows where it is unsure. If there are no darts nearby, the sheet gets "wobbly" (high uncertainty). If there are darts close by, it stays firm (low uncertainty).
This allows the system to create a dense, 3D map from very sparse data, while also knowing exactly which parts of the map are reliable and which parts are just guesses.

4. The Result: A Perfect Movie

They take this new, reliable 3D map and feed it into the AI painter.

Instead of the AI guessing where the walls are, it now has a solid blueprint.
When the camera moves, the AI knows exactly how the objects should shift and rotate.
The Outcome: The video is smooth, the geometry is correct, and the "glitches" disappear.

The Proof: "A Little Help Goes a Long Way"

The researchers tested this on real driving footage.

The Old Way (Vision Only): The AI guessed the depth. The resulting video was shaky and looked wrong (like a bad 3D movie).
The New Way (Vision + Sparse Radar): They used radar data that covered only 0.02% of the image (basically a few pixels out of thousands).
The Result: Even with such tiny amounts of extra data, the video quality improved massively. The "glitchiness" dropped by nearly half, and the images looked much more realistic.

The Big Takeaway

You don't need a full 3D scanner or a million photos to make a great 3D movie from one picture. You just need one photo and a tiny bit of real distance data (like a few radar pings) to guide the AI.

It's like trying to navigate a dark room: You could stumble around guessing where the furniture is (Vision Only), or you could tap the wall with a cane a few times (Sparse Radar) to get a rough map. That small bit of extra information makes the difference between falling over and walking smoothly.

In short: A single image + a little bit of multimodal sensing = A perfect 3D experience.

1. Problem Statement

The paper addresses the challenge of Single-Image Novel View Synthesis (NVS), specifically the generation of temporally consistent videos from a single input image by moving the camera to new viewpoints.

Current Limitations: State-of-the-art diffusion-based NVS pipelines typically rely on monocular depth estimation (vision-only) to create a 3D geometric initialization (e.g., a point cloud) before rendering new views.
The Core Issue: Monocular depth estimation is an ill-posed problem. In real-world conditions (low texture, adverse weather, occlusions), these estimates are often inaccurate or spatially inconsistent.
Consequence: Errors in the initial depth map are amplified during geometric back-projection and rendering. This leads to misalignment artifacts, inconsistent geometry, and poor temporal coherence in the generated video, fundamentally limiting the quality of diffusion-based synthesis.

2. Methodology

The authors propose a multimodal framework that replaces the unreliable vision-only depth estimator with a sparse range-sensor-based depth reconstruction module. This module acts as a "drop-in" replacement for existing diffusion pipelines without modifying the generative model itself.

A. Multimodal Depth Reconstruction via Localized Gaussian Processes

The core innovation is a method to convert extremely sparse range data (from automotive Radar or LiDAR) into dense, uncertainty-aware depth maps aligned with the RGB image.

Angular Domain Representation: Instead of working in pixel coordinates, the method maps both sparse sensor points and image pixels into a shared angular domain (azimuth $\phi$ and elevation $\theta$ ). This avoids projection ambiguities and naturally aligns sparse 3D points with dense 2D image rays.
Localized Gaussian Process (GP) Modeling:
- Depth is modeled as a latent function $Z(a)$ over the angular domain.
- To handle computational complexity (standard GP is $O(T^3)$ ), the authors use a localized formulation. For each query pixel (angular location $a^*$ ), a local neighborhood of sparse measurements is selected.
- Independent GPs are fitted to these local neighborhoods using a Radial Basis Function (RBF) kernel.
- Output: The GP posterior mean provides the dense depth estimate, while the posterior variance provides a quantified uncertainty map.
Uncertainty Masking: During the rendering phase, depth estimates with high variance (low reliability) are masked out, preventing noisy geometry from corrupting the conditioning frames for the diffusion model.

B. Integration with Diffusion Pipeline

The proposed module integrates seamlessly into existing geometry-conditioned diffusion pipelines (specifically the GEN3C pipeline):

Input: A single RGB image and sparse range measurements (Radar/LiDAR).
Reconstruction: The GP module generates a dense depth map and uncertainty mask.
3D Initialization: The depth map and RGB image are back-projected to form a colored 3D point cloud.
Rendering: The point cloud is rendered along a target camera trajectory to generate sparse "novel view" frames.
Diffusion: These rendered frames serve as geometric conditioning for a pre-trained diffusion model, which hallucinates missing content in disoccluded regions to produce a final, temporally consistent video.

3. Key Contributions

Multimodal Depth as a Geometric Prior: Demonstrates that incorporating extremely sparse range data (as little as 0.02% of image pixels for Radar) significantly outperforms vision-only depth estimation for NVS.
Localized Gaussian Process Framework: Introduces a computationally efficient method for dense depth reconstruction from sparse data that explicitly quantifies uncertainty, enabling robust geometric conditioning.
Drop-in Replacement: The approach is model-agnostic; it improves existing diffusion-based rendering pipelines without requiring retraining of the generative model.
Empirical Validation: Provides extensive quantitative and qualitative evidence on real-world autonomous driving data showing that reliable geometric priors are critical for high-fidelity synthesis.

4. Experimental Results

The method was evaluated on the View-of-Delft (VoD) dataset, a multi-modal autonomous driving dataset containing synchronized Radar, Camera, and LiDAR data.

Quantitative Improvements (Single-Image Novel-View Video Generation)

Compared to the baseline GEN3C pipeline using the MoGe monocular depth estimator, the proposed multimodal approach yielded significant improvements:

With Sparse Radar (0.02% pixel coverage):
- LPIPS: Reduced by 23.5% (0.5804 $\to$ 0.4441), indicating better perceptual similarity.
- FID: Reduced by 46.0% (152.62 $\to$ 82.41), indicating higher distributional quality.
- Temporal LPIPS: Reduced by 29.3%, showing improved temporal consistency.
- PSNR/SSIM: Increased from 12.36 to 14.26 and 0.4561 to 0.4860, respectively.
With Sparse LiDAR (0.52% pixel coverage):
- Achieved even higher fidelity (PSNR 14.69, LPIPS 0.4230), confirming that denser sparse data yields further gains.

Depth Estimation Accuracy

When evaluated against LiDAR ground truth:

The proposed method reduced Mean Absolute Error (MAE) by 4.5% compared to the best monocular baseline (MoGe).
It achieved lower log-depth RMSE than both MoGe and Depth Anything V2.

Qualitative Results

Visual comparisons showed that the multimodal approach significantly reduced view-dependent artifacts (e.g., warping, ghosting) and improved geometric alignment in generated frames, particularly in challenging scenes with low texture or occlusions.

5. Significance

This work fundamentally shifts the paradigm for single-image novel view synthesis by proving that reliable geometric priors are more critical than the complexity of the generative model itself.

Practicality: It demonstrates that even extremely sparse sensor data (common in automotive radar) is sufficient to overcome the limitations of vision-only depth estimation.
Efficiency: The localized GP approach allows for efficient inference, making it suitable for real-time or near-real-time applications.
Broader Impact: The findings suggest that multimodal sensing is essential for robust 3D perception tasks, offering a pathway to more reliable VR, robotics, and autonomous system navigation where single-view inputs are the norm.