Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

Imagine you are trying to tell a story about a room by walking around it, taking photos, and showing them to someone who has never seen the room before.

The Old Way (The "Broken Chain"):
Previously, AI systems tried to do this by acting like a clumsy construction crew.

They would look at a photo and guess the shape of the room (like guessing where the walls are).
They would build a rough 3D model based on that guess.
They would try to take a "photo" from a new angle using that model.
Because the model was rough, the new photo would look blurry or have holes. So, they'd have to use a different tool to "paint over" the holes (inpainting).
Then, they'd use that new, slightly better photo to guess the shape again for the next step.

The Problem: Every time they made a guess or painted over a hole, they made tiny mistakes. Because they did this step-by-step with different tools, the mistakes piled up like a snowball rolling down a hill. By the time they reached the end of the video, the room looked nothing like the beginning. The walls might be floating, or the furniture might have melted.

The New Way (Geometry-as-Context / GaC):
The authors of this paper, "Geometry-as-Context," realized that instead of using a clumsy construction crew with separate tools, they should hire a single, super-talented artist who can do everything in one fluid motion.

Here is how their new method works, using a few analogies:

1. The "All-in-One" Artist

Instead of stopping to build a 3D model and then painting, the AI learns to do both at the same time. It looks at the current picture, imagines the 3D shape in its head, and immediately paints the next frame.

Analogy: Think of a magician pulling a rabbit out of a hat. In the old way, the magician would have to go backstage, build a fake rabbit, walk back out, and put it in the hat. In the new way, the rabbit just appears because the magician knows exactly how the trick works. The AI "knows" the 3D shape without needing to build a separate, error-prone model first.

2. The "Camera Remote Control" (Camera Gated Attention)

The AI needs to know exactly where the camera is moving. If the camera turns left, the AI needs to know to show the left wall.

Analogy: Imagine the AI is a driver in a car. In the old systems, the driver had to look at a map, guess the road, and then steer. In this new system, the camera pose is like a GPS that talks directly to the steering wheel. The paper introduces a special "gate" (a smart switch) that tells the AI: "Hey, right now we are looking at the shape of the room," or "Now we are painting the picture." This prevents the AI from getting confused about what it's supposed to be doing at any given second.

3. The "Training with a Safety Net" (Geometry Dropout)

To teach the AI, the researchers showed it a sequence of images mixed with "blueprints" (geometry data).

The Trick: Sometimes, they would hide the blueprints during training.
Why? Imagine teaching a student to drive. You let them drive with a map (blueprints) for a while. Then, you take the map away and say, "Okay, drive without it!" If they can still drive well, it means they actually learned the road, not just memorized the map.
The Result: This allows the AI to generate beautiful videos for users who don't want to see the blueprints, while still having learned the 3D rules from the blueprints during training.

4. The "Time Travel" Test

The paper tested this by making the camera go forward and then immediately backward (a "forth-and-back" journey).

The Old Way: By the time the camera returned to the start, the room had changed. The chair might have moved, or the color might have shifted.
The New Way: The camera returns to the start, and the room looks exactly the same as it did at the beginning. The AI remembered the 3D structure perfectly, like a human remembering a room they just walked out of.

Summary

Geometry-as-Context is like upgrading from a team of clumsy builders who keep making mistakes and piling them up, to a single, brilliant director who understands the 3D world perfectly. By combining the "thinking" (geometry) and the "drawing" (video) into one smooth process, the AI creates videos that stay consistent, look realistic, and don't fall apart when the camera moves around.

1. Problem Statement

Scene-consistent video generation aims to generate videos that explore a 3D scene based on a user-defined camera trajectory, ensuring that the geometry and texture of objects remain consistent across different viewpoints.

Existing approaches suffer from two main limitations:

Video-based methods: Rely on external memory or retrieval mechanisms. While they can achieve preliminary consistency, they struggle with complex scenes and large camera movements, often failing to maintain strict 3D consistency.
Reconstruction-based methods: Use explicit 3D signals (e.g., point clouds, 3D Gaussian Splatting) to iteratively synthesize novel views. However, these methods suffer from cumulative errors due to:
1. Non-differentiable operators: The pipeline involves inverse rendering and unprojection steps that cannot be backpropagated through.
2. Non-end-to-end training: Geometry estimation and image inpainting are handled by separate models. Errors in early stages (geometry estimation) propagate and amplify through subsequent rendering and inpainting steps, leading to the "butterfly effect" where the scene degrades over long sequences.

2. Methodology: Geometry-as-Context (GaC)

The core idea of GaC is to replace the non-differentiable, multi-model reconstruction pipeline with a single, fully differentiable, autoregressive video generation model. This model treats geometry estimation, 3D reconstruction simulation, and image inpainting as a unified sequence generation task.

A. Framework Reformulation

Instead of the traditional iterative loop (Estimate Geometry $\to$ Unproject $\to$ Render $\to$ Inpaint), GaC interleaves these steps into a single autoregressive sequence.

Input Sequence: The model processes a sequence of frames containing: [Current Image, Geometry Context, Warped Image, Next Image, ...].
Unified Model: A single camera-controlled video generation model (based on a DiT architecture) predicts the next frame in the sequence.
Variant Selection: The authors propose three variants but select Variant #1 (Geometry as Context) as the primary training scheme. In this variant, the model predicts the geometry ( $G_i$ ) of the current view and then generates the RGB image ( $I_{i+1}$ ) for the next view. This explicit geometric estimation helps the model learn 3D structure while maintaining efficiency.

B. Key Architectural Components

Camera Gated Attention (CGA):
- Challenge: Standard camera conditioning (e.g., simple concatenation) cannot distinguish between tasks requiring geometry estimation versus those requiring image synthesis.
- Solution: CGA encodes camera poses using Plücker rays. These rays are integrated into the self-attention mechanism.
- Mechanism: The camera features modulate the Query ( $Q$ ) and generate a Gating Matrix. This gate regulates the output of the self-attention layer, allowing the model to dynamically decide how much camera information should influence geometry prediction versus image generation.
Training Strategy: Geometry Dropout:
- Challenge: Training with interleaved geometry and image sequences doubles the sequence length (reducing efficiency) and forces the model to output geometry even when the user only wants RGB video during inference.
- Solution: During training, the geometry context is randomly dropped with a certain probability.
- Benefit: This forces the model to learn scene consistency from the geometry context when available, but also enables it to bypass redundant geometry output during inference, allowing for pure image-to-image generation. It also reduces training/inference costs.

3. Key Contributions

GaC Framework: A novel paradigm that internalizes non-differentiable 3D reconstruction operators into a generative model, enabling end-to-end training and eliminating cumulative errors caused by separate models.
Camera Gated Attention (CGA): A specialized attention mechanism that uses Plücker rays to modulate self-attention, significantly improving the model's ability to control camera poses and distinguish between geometry and texture tasks.
Geometry Dropout Strategy: A training technique that balances the benefits of explicit 3D context with inference efficiency, allowing the model to generate consistent videos without mandatory geometry output.
State-of-the-Art Performance: The method achieves superior results in both single-view and "forth-and-back" (cyclic) camera trajectories, demonstrating robust long-term 3D memory.

4. Experimental Results

The model was trained on the RealEstate10K dataset and evaluated on RealEstate10K and Tanks-and-Temples (which features larger camera motions).

Quantitative Performance:
- GaC outperforms baselines (CameraCtrl, ViewCrafter, Voyager, etc.) across all metrics.
- FID: Lower (55.76 vs. 65.12 for Voyager), indicating better alignment with the data distribution.
- LPIPS: Lower (0.354 vs. 0.395), indicating better perceptual similarity.
- Camera Accuracy: Significantly lower rotation ( $R_{err}$ ) and translation ( $T_{err}$ ) errors, proving superior camera control.
Qualitative Performance:
- Consistency: GaC maintains object identity and texture consistency even when objects disappear and reappear (e.g., in cyclic "forth-and-back" trajectories).
- Visual Quality: Generates sharper textures and more accurate colors compared to reconstruction-based methods which often suffer from artifacts or blurring.
Ablation Studies:
- Variants: Variant #1 (Geometry as Context) performed best, confirming that explicit geometric estimation aids 3D consistency.
- CGA: Removing CGA led to a significant drop in camera control accuracy ( $R_{err}$ increased from 0.024 to 0.032).
- Dropout: Geometry dropout reduced training time by ~50% (24s/step to 11s/step) with negligible performance loss.

5. Significance

This work represents a significant shift in 3D-aware video generation. By moving from a modular, non-differentiable pipeline to a unified, differentiable generative framework, GaC solves the fundamental issue of error accumulation in long-horizon scene generation.

For Applications: It enables high-fidelity, interactive 3D experiences for AR/VR, gaming, and embodied AI, where maintaining consistent geometry over long camera trajectories is critical.
For Research: It demonstrates that explicit 3D information (geometry) can be effectively leveraged as a "context" within generative models, bridging the gap between traditional computer vision reconstruction and modern diffusion/autoregressive generation.

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

1. The "All-in-One" Artist

2. The "Camera Remote Control" (Camera Gated Attention)

3. The "Training with a Safety Net" (Geometry Dropout)

4. The "Time Travel" Test

Summary

1. Problem Statement

2. Methodology: Geometry-as-Context (GaC)

A. Framework Reformulation

B. Key Architectural Components

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation