Geometry-Aware Rotary Position Embedding for Consistent Video World Model

This paper introduces ViewRope, a geometry-aware rotary position embedding that injects camera-ray directions into video transformers to resolve spatial persistence issues and hallucinations in predictive world models, accompanied by a sparse attention mechanism and a new diagnostic benchmark for evaluating long-term 3D consistency.

Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are playing a video game where you can walk around a virtual world. You turn left, look at a red brick wall, walk in a circle, and then turn right to face that same wall again.

In a perfect world, that wall should look exactly the same as it did a moment ago. But in many current AI video generators, something weird happens: when you turn back, the wall might look blurry, the bricks might have changed color, or the AI might hallucinate that a tree suddenly grew there. The AI has "forgotten" what it just saw.

This paper introduces a new system called ViewRope (short for View-Rotary Position Embedding) that fixes this problem. Here is how it works, explained simply:

1. The Problem: The "Amnesia" of Current AI

Most video AI models are like a person trying to remember a room while only looking at a single photograph at a time. They know what the current picture looks like, but they don't have a good map of how the room is built in 3D space.

  • The Old Way: The AI thinks in "screen coordinates." It remembers, "The red brick was at pixel 100, row 50." But if you turn your head, the red brick moves to pixel 200, row 10. The AI thinks, "Oh, that's a new object!" and gets confused. It loses track of the 3D reality.
  • The Result: When the camera loops back around (like turning 360 degrees), the AI creates a "drift." The world looks different than it did before, breaking the illusion of a consistent reality.

2. The Solution: ViewRope (The "Compass" System)

The authors realized that to keep a world consistent, the AI shouldn't just look at where a pixel is on the screen. It needs to know where the camera is looking in 3D space.

Think of ViewRope as giving the AI a 3D compass for every single patch of the video.

  • The Analogy: Imagine you are holding a flashlight in a dark room.
    • Old AI: Only remembers the shape of the shadow on the wall. If you move the flashlight, the shadow changes, and the AI thinks the wall changed.
    • ViewRope: Remembers the direction the flashlight is pointing. Even if the shadow moves to a different part of the wall, the AI knows, "Ah, this is the same flashlight beam hitting a different spot."
  • How it works: Instead of just saying "Pixel X is here," ViewRope tells the AI, "This pixel is being seen from this specific angle." When the camera turns back to the original spot, the AI recognizes the angle immediately and says, "I've seen this before!" and retrieves the correct memory.

3. The Efficiency Hack: "Geometry-Aware Sparse Attention"

There's a second problem: remembering everything is slow. If you watch a 10-minute video, the AI has to compare every new frame against every single previous frame. That's like trying to find a specific book in a library by checking every single book on every shelf, one by one. It takes forever.

The paper introduces a smart filter called Geometry-Aware Frame-Sparse Attention.

  • The Analogy: Imagine you are looking for a specific friend in a crowded stadium.
    • The Old Way (Dense Attention): You scan the entire crowd, looking at every single person's face, even those on the other side of the stadium who are definitely not your friend.
    • The New Way (Sparse Attention): You use your "Compass" (ViewRope). You know your friend is wearing a red hat and is standing in the North section. You instantly ignore everyone in the South, East, and West sections. You only look at the North section.
  • The Result: The AI ignores irrelevant history and only "pays attention" to the specific moments in the past that match the current camera angle. This makes the system much faster (saving about 25% of computing time) while actually making the memory better because it's not getting distracted by useless data.

4. The Proof: ViewBench

To prove this works, the authors built a new test called ViewBench.

  • The Test: They make the AI generate a video where the camera spins around a room and comes back to the exact starting point (a "loop closure").
  • The Score: They measure how much the room changed when the camera returned.
  • The Outcome: ViewRope was significantly better than previous state-of-the-art models. It kept the scene consistent, reduced "hallucinations" (fake details), and did it all while running faster.

Summary

In short, ViewRope teaches video AI to stop thinking like a 2D photographer and start thinking like a 3D explorer. By giving the AI a built-in understanding of camera angles and directions, it can remember the world consistently, even after long, complex camera movements, without getting slow or confused.

It's the difference between a camera that just takes pictures and a camera that actually understands the world it is filming.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →