ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

Imagine you are trying to recreate a 3D movie scene, but you only have two photos: one taken from the far left and one from the far right. Your goal is to generate all the frames in between, as if a camera smoothly glided from left to right, revealing parts of the room you've never seen before.

This is the challenge the paper ConfCtrl tackles. Here is how they solved it, explained simply:

The Problem: Two Bad Options

Currently, there are two ways to try to do this, and both have flaws:

The "Strict Architect" (Regression Methods): These models are like rigid architects. They try to calculate the exact 3D shape of the room based on your two photos.
- The Flaw: If the room has a chair hidden behind a table in your photos, the architect gets confused. They can't "imagine" the chair, so they leave a blurry hole or a weird glitch in the video. They are good at geometry but bad at creativity.
The "Daydreaming Artist" (Diffusion Models): These are powerful AI artists trained on millions of videos. They are great at imagining what a hidden chair looks like.
- The Flaw: They are terrible at following instructions. If you tell them, "Move the camera exactly 5 feet to the right," they might drift off course, tilt the camera weirdly, or forget where they started. They are creative but uncontrollable.

The Solution: ConfCtrl (The "Smart Navigator")

The authors created ConfCtrl, a system that combines the best of both worlds. Think of it as a Smart Navigator guiding a Creative Driver.

Here is how it works in three simple steps:

1. The "Confidence Map" (Knowing What to Trust)

The system first looks at the 3D data it gets from the two photos. But it knows that this data is "noisy" (like a GPS signal that sometimes jumps around).

The Analogy: Imagine you are hiking with a map that has some foggy, unclear areas. A normal hiker might get lost in the fog. ConfCtrl is like a hiker who carries a Confidence Map. It says, "I trust the trail markers here (high confidence), but I'm not sure about this swampy area (low confidence)."
The Magic: Instead of blindly following the shaky 3D map, the AI uses this confidence map to decide how much to trust the geometry. It leans on the solid parts and ignores the shaky parts.

2. The "Predict-Update" Loop (The Kalman Filter)

This is the brain of the operation, inspired by how submarines or self-driving cars navigate.

The Prediction: The AI guesses where the camera should be next based on your instructions (e.g., "Move right").
The Update: It then checks its "noisy" 3D map.
- If the map agrees with the prediction, great!
- If the map is shaky or wrong (like the swampy area), the AI says, "I see the map is confused, so I'll stick closer to my original plan."
- If the map is clear, it says, "Okay, the map is right, let's adjust slightly."
The Result: This back-and-forth "Predict-Update" dance ensures the camera stays on the exact path you wanted, without getting lost in the noise.

3. Starting with a Head Start (Initialization)

Most AI video generators start with pure static noise (like TV snow) and try to turn it into a video. ConfCtrl is smarter.

The Analogy: Instead of starting a race from a complete standstill, ConfCtrl starts the race already halfway there. It takes the "Confidence Map" and mixes it with the noise right at the beginning.
Why it helps: This gives the AI a strong hint about the shape of the room immediately, so it doesn't have to guess as much. It's like giving the artist a rough sketch before asking them to paint the masterpiece.

The Outcome

By using this "Smart Navigator" approach, ConfCtrl can:

Follow instructions perfectly: The camera moves exactly where you tell it to, without drifting.
Fill in the blanks: It can "hallucinate" (imagine) the parts of the scene you didn't see in the original photos, like the back of a chair, with high quality.
Work anywhere: Because it learned from a massive video model, it can handle new, unseen environments without needing to be retrained.

In short: ConfCtrl is like a GPS that knows when the signal is bad and ignores the glitches, ensuring your creative video journey stays on the exact path you planned, even when the scenery gets complicated.

Here is a detailed technical summary of the paper "ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation."

1. Problem Statement

The paper addresses the challenge of Novel View Synthesis (NVS) from only two sparse input images under large viewpoint changes. Existing methods face a trade-off between geometric accuracy and generative capability:

Regression-based methods: These feedforward models (e.g., Gaussian Splatting variants) can follow camera trajectories well but lack generative priors. Consequently, they fail to reconstruct unseen regions, leading to artifacts and hallucinations when input views are sparse.
Diffusion-based methods: These leverage powerful generative priors from large-scale pretraining but struggle to strictly adhere to target camera poses. They often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning, resulting in geometric inconsistencies.

The core challenge is to design a system that combines the generative capacity of diffusion models with the geometric precision of regression-based approaches, specifically handling the uncertainty inherent in 3D priors derived from foundation models.

2. Methodology: ConfCtrl

The authors propose ConfCtrl, a confidence-aware video interpolation framework built upon a pretrained video interpolation model (Wan2.1-Interpolation). The method introduces two key innovations to bridge the gap between interpolation and novel view synthesis:

A. Confidence-Aware Noise Initialization

Instead of initializing the rectified flow diffusion process with pure Gaussian noise, ConfCtrl initializes the latent space with a weighted sum of a projected point cloud latent and noise.

Mechanism: A point-wise confidence map (derived from a 3D foundation model like VGGT) quantifies the reliability of each estimated 3D point.
Formula: $z_0 = \lambda_1 \cdot (w \odot \hat{z}_{pc}) + \lambda_2 \cdot \epsilon$ $z_{0} = λ_{1} \cdot (w ⊙ \overset{z}{^}_{p c}) + λ_{2} \cdot ϵ$
- Where $w$ represents confidence weights, $\hat{z}_{pc}$ is the projected point cloud latent, and $\epsilon$ is Gaussian noise.
Benefit: This provides a more reliable initial distribution that adapts the model from "temporal interpolation" to "novel view synthesis," leveraging the strong geometric priors of the foundation model while accounting for its uncertainty.

B. Predict-Update Camera Conditioning (Kalman-Inspired)

To mitigate uncertainty in 3D geometric priors (e.g., distortion, scale ambiguity), the authors introduce a Kalman Filter-inspired predict-update mechanism integrated into the Diffusion Transformer (DiT) blocks.

Prediction Submodule: Conditions the latent state solely on the target camera pose (control input $u$ ). This generates a prediction of the scene based on the desired trajectory.
Update Submodule: Treats the projected point cloud as a "noisy measurement." It learns a residual correction ( $\Delta$ $Δ$ ) to refine the prediction by fusing it with the geometric observation.
- $z_{update} = z_{pred} + \Delta(z_{pred}, \hat{z}_{pc})$
Function: This architecture allows the model to dynamically balance the "belief" in the camera trajectory against the "observation" from the 3D prior. If the point cloud is uncertain (low confidence), the model relies more on the camera pose; if the pose is ambiguous, it leans on the geometry.

C. Training Objective

The model is trained using a Rectified Flow objective.

Loss Function: Standard flow matching loss ( $L_{RF}$ ) combined with a Latent Gradient Regularization term ( $L_{grad}$ ).
Regularization: $L_{grad}$ enforces alignment of spatial gradients in the latent space, ensuring high-frequency details are preserved and reducing flickering artifacts during rapid viewpoint changes.

3. Key Contributions

Leveraging Interpolation Priors: Demonstrated that pretrained video interpolation models provide superior 3D consistency for NVS under sparse inputs compared to standard video generation models.
Confidence-Aware Initialization: Introduced a novel initialization strategy using confidence-weighted point cloud latents, enabling effective adaptation from interpolation to NVS.
Predict-Update Conditioning: Proposed a Kalman-inspired conditioning mechanism that jointly encodes camera poses and noisy 3D priors, achieving robust geometry and precise camera control without requiring extensive retraining data.
State-of-the-Art Performance: Achieved consistent improvements over both regression-based and diffusion-based baselines across multiple datasets, with strong zero-shot generalization capabilities.

4. Experimental Results

The method was evaluated on CO3D-Hydrant, CO3D-Teddybear, and DL3DV datasets, as well as cross-dataset benchmarks (RealEstate10k, GraspNet).

Quantitative Performance: ConfCtrl outperformed all baselines (including PixelSplat, AnySplat, CameraCtrl, ViewCrafter) in:
- Image Quality: Higher PSNR, SSIM, and lower LPIPS.
- Camera Control: Significantly reduced Translation Error ( $E_t$ ) and Rotation Error ( $E_r$ ), indicating strict adherence to target poses.
- Generative Metrics: Lower FID and FVD scores compared to other diffusion methods, indicating better visual fidelity.
Qualitative Results: The method successfully reconstructed occluded regions and maintained sharp details under large viewpoint changes, whereas regression methods suffered from artifacts and diffusion methods drifted from the camera path.
Zero-Shot Generalization: The model demonstrated strong performance on out-of-distribution datasets without fine-tuning, proving the robustness of the learned geometric priors.
Ablation Studies: Confirmed that removing either the confidence-aware initialization or the predict-update module significantly degrades performance, validating the necessity of both components.

5. Significance and Impact

ConfCtrl represents a significant step forward in controllable video generation and 3D reconstruction.

Bridging the Gap: It effectively bridges the gap between the geometric precision of feedforward 3D methods and the generative power of diffusion models.
Handling Uncertainty: By explicitly modeling uncertainty via a Kalman-inspired framework, it offers a robust solution for scenarios where 3D priors (like depth maps) are noisy or incomplete.
Practical Application: The ability to generate high-quality, geometrically consistent novel views from just two images with precise camera control is crucial for applications in AR/VR, robotics, and digital content creation, where large viewpoint changes are common but data is scarce.

Limitations: The authors note that the current approach is constrained by the Video VAE architecture, which is optimized for smooth temporal transitions rather than abrupt camera motions. Future work may involve optimizing the VAE or removing it entirely to better handle large positional changes.