Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

This paper introduces "geometry-as-context," an autoregressive framework that iteratively estimates scene geometry and restores novel views to achieve superior scene consistency and camera control in video generation, overcoming the error accumulation and non-differentiability issues of previous methods.

JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, Yanye Lu

Published 2026-02-26
📖 4 min read☕ Coffee break read

Imagine you are trying to tell a story about a room by walking around it, taking photos, and showing them to someone who has never seen the room before.

The Old Way (The "Broken Chain"):
Previously, AI systems tried to do this by acting like a clumsy construction crew.

  1. They would look at a photo and guess the shape of the room (like guessing where the walls are).
  2. They would build a rough 3D model based on that guess.
  3. They would try to take a "photo" from a new angle using that model.
  4. Because the model was rough, the new photo would look blurry or have holes. So, they'd have to use a different tool to "paint over" the holes (inpainting).
  5. Then, they'd use that new, slightly better photo to guess the shape again for the next step.

The Problem: Every time they made a guess or painted over a hole, they made tiny mistakes. Because they did this step-by-step with different tools, the mistakes piled up like a snowball rolling down a hill. By the time they reached the end of the video, the room looked nothing like the beginning. The walls might be floating, or the furniture might have melted.

The New Way (Geometry-as-Context / GaC):
The authors of this paper, "Geometry-as-Context," realized that instead of using a clumsy construction crew with separate tools, they should hire a single, super-talented artist who can do everything in one fluid motion.

Here is how their new method works, using a few analogies:

1. The "All-in-One" Artist

Instead of stopping to build a 3D model and then painting, the AI learns to do both at the same time. It looks at the current picture, imagines the 3D shape in its head, and immediately paints the next frame.

  • Analogy: Think of a magician pulling a rabbit out of a hat. In the old way, the magician would have to go backstage, build a fake rabbit, walk back out, and put it in the hat. In the new way, the rabbit just appears because the magician knows exactly how the trick works. The AI "knows" the 3D shape without needing to build a separate, error-prone model first.

2. The "Camera Remote Control" (Camera Gated Attention)

The AI needs to know exactly where the camera is moving. If the camera turns left, the AI needs to know to show the left wall.

  • Analogy: Imagine the AI is a driver in a car. In the old systems, the driver had to look at a map, guess the road, and then steer. In this new system, the camera pose is like a GPS that talks directly to the steering wheel. The paper introduces a special "gate" (a smart switch) that tells the AI: "Hey, right now we are looking at the shape of the room," or "Now we are painting the picture." This prevents the AI from getting confused about what it's supposed to be doing at any given second.

3. The "Training with a Safety Net" (Geometry Dropout)

To teach the AI, the researchers showed it a sequence of images mixed with "blueprints" (geometry data).

  • The Trick: Sometimes, they would hide the blueprints during training.
  • Why? Imagine teaching a student to drive. You let them drive with a map (blueprints) for a while. Then, you take the map away and say, "Okay, drive without it!" If they can still drive well, it means they actually learned the road, not just memorized the map.
  • The Result: This allows the AI to generate beautiful videos for users who don't want to see the blueprints, while still having learned the 3D rules from the blueprints during training.

4. The "Time Travel" Test

The paper tested this by making the camera go forward and then immediately backward (a "forth-and-back" journey).

  • The Old Way: By the time the camera returned to the start, the room had changed. The chair might have moved, or the color might have shifted.
  • The New Way: The camera returns to the start, and the room looks exactly the same as it did at the beginning. The AI remembered the 3D structure perfectly, like a human remembering a room they just walked out of.

Summary

Geometry-as-Context is like upgrading from a team of clumsy builders who keep making mistakes and piling them up, to a single, brilliant director who understands the 3D world perfectly. By combining the "thinking" (geometry) and the "drawing" (video) into one smooth process, the AI creates videos that stay consistent, look realistic, and don't fall apart when the camera moves around.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →