Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Imagine you are playing a video game where you can walk around a virtual world. You turn left, look at a red brick wall, walk in a circle, and then turn right to face that same wall again.

In a perfect world, that wall should look exactly the same as it did a moment ago. But in many current AI video generators, something weird happens: when you turn back, the wall might look blurry, the bricks might have changed color, or the AI might hallucinate that a tree suddenly grew there. The AI has "forgotten" what it just saw.

This paper introduces a new system called ViewRope (short for View-Rotary Position Embedding) that fixes this problem. Here is how it works, explained simply:

1. The Problem: The "Amnesia" of Current AI

Most video AI models are like a person trying to remember a room while only looking at a single photograph at a time. They know what the current picture looks like, but they don't have a good map of how the room is built in 3D space.

The Old Way: The AI thinks in "screen coordinates." It remembers, "The red brick was at pixel 100, row 50." But if you turn your head, the red brick moves to pixel 200, row 10. The AI thinks, "Oh, that's a new object!" and gets confused. It loses track of the 3D reality.
The Result: When the camera loops back around (like turning 360 degrees), the AI creates a "drift." The world looks different than it did before, breaking the illusion of a consistent reality.

2. The Solution: ViewRope (The "Compass" System)

The authors realized that to keep a world consistent, the AI shouldn't just look at where a pixel is on the screen. It needs to know where the camera is looking in 3D space.

Think of ViewRope as giving the AI a 3D compass for every single patch of the video.

The Analogy: Imagine you are holding a flashlight in a dark room.
- Old AI: Only remembers the shape of the shadow on the wall. If you move the flashlight, the shadow changes, and the AI thinks the wall changed.
- ViewRope: Remembers the direction the flashlight is pointing. Even if the shadow moves to a different part of the wall, the AI knows, "Ah, this is the same flashlight beam hitting a different spot."
How it works: Instead of just saying "Pixel X is here," ViewRope tells the AI, "This pixel is being seen from this specific angle." When the camera turns back to the original spot, the AI recognizes the angle immediately and says, "I've seen this before!" and retrieves the correct memory.

3. The Efficiency Hack: "Geometry-Aware Sparse Attention"

There's a second problem: remembering everything is slow. If you watch a 10-minute video, the AI has to compare every new frame against every single previous frame. That's like trying to find a specific book in a library by checking every single book on every shelf, one by one. It takes forever.

The paper introduces a smart filter called Geometry-Aware Frame-Sparse Attention.

The Analogy: Imagine you are looking for a specific friend in a crowded stadium.
- The Old Way (Dense Attention): You scan the entire crowd, looking at every single person's face, even those on the other side of the stadium who are definitely not your friend.
- The New Way (Sparse Attention): You use your "Compass" (ViewRope). You know your friend is wearing a red hat and is standing in the North section. You instantly ignore everyone in the South, East, and West sections. You only look at the North section.
The Result: The AI ignores irrelevant history and only "pays attention" to the specific moments in the past that match the current camera angle. This makes the system much faster (saving about 25% of computing time) while actually making the memory better because it's not getting distracted by useless data.

4. The Proof: ViewBench

To prove this works, the authors built a new test called ViewBench.

The Test: They make the AI generate a video where the camera spins around a room and comes back to the exact starting point (a "loop closure").
The Score: They measure how much the room changed when the camera returned.
The Outcome: ViewRope was significantly better than previous state-of-the-art models. It kept the scene consistent, reduced "hallucinations" (fake details), and did it all while running faster.

Summary

In short, ViewRope teaches video AI to stop thinking like a 2D photographer and start thinking like a 3D explorer. By giving the AI a built-in understanding of camera angles and directions, it can remember the world consistently, even after long, complex camera movements, without getting slow or confused.

It's the difference between a camera that just takes pictures and a camera that actually understands the world it is filming.

1. Problem Statement

The paper addresses a critical limitation in current predictive video world models: the lack of long-term spatial persistence and geometric consistency.

The Issue: While modern video diffusion models excel at generating short-term coherent sequences, they fail to maintain stable scene structures over long camera trajectories. When a camera revisits a previously observed viewpoint (a "loop-closure" scenario, e.g., rotating away and then back), existing models often hallucinate new details or exhibit "geometric drift," failing to reconstruct the original scene accurately.
Root Cause: The authors identify that this failure stems from a mismatch between screen-space positional embeddings (standard 2D/3D RoPE based on pixel coordinates $x, y, t$ ) and projective geometry. In 3D space, the same physical point can map to widely separated pixel locations over time due to camera motion, while nearby pixels may not be co-visible. Standard positional biases, which rely on pixel locality, are misaligned with the invariances required for view-consistent generation.
Current Limitations: Existing solutions either rely on external memory structures (which are computationally expensive and brittle) or rigid 3D pipelines (which sacrifice open-domain generative flexibility).

2. Methodology

The authors propose ViewRope, a geometry-aware positional encoding integrated directly into the transformer's self-attention mechanism, coupled with a sparse attention strategy for efficiency.

A. ViewRope (View-Centric Positional Encoding)

Instead of encoding pixel offsets, ViewRope encodes camera-ray directions directly into the attention mechanism.

Per-Patch Ray Construction: For each image patch, the model computes a normalized 3D viewing ray ( $r_{i,u,v}$ ) based on camera intrinsics ( $K$ ) and extrinsics ( $R, P$ ).
Rotation Injection: The model constructs a local rotation matrix ( $R_{local}$ ) that maps the canonical optical axis to the specific viewing ray. This is combined with the camera's extrinsic rotation to form a world-aligned view rotation ( $R_{i,u,v}$ ).
Feature Rotation: The query ( $q$ ) and key ( $k$ ) feature vectors are partitioned into 3D subvectors. These subvectors are rotated by $R_{i,u,v}$ before the dot product calculation in attention.
Geometric Sensitivity: The attention score becomes a function of the relative rotation between two viewing rays ( $R^{-1}_i R_j$ ). This allows the model to identify that two temporally distant tokens correspond to the same physical 3D content if their viewing rays are co-visible, regardless of their pixel coordinates.

B. Geometry-Aware Frame-Sparse Attention

To handle long-context generation without quadratic computational costs ( $O(N^2)$ ), the authors introduce a sparse attention mechanism driven by geometric relevance.

Block-Level Sparsity: The sequence is divided into blocks corresponding to video frames.
Stochastic Relevance Estimation: Instead of computing full attention between all frames, the model samples a small set of tokens to estimate the affinity between a query block and historical key blocks.
Top-K Selection: Based on the estimated geometric affinity (derived from ViewRope), the model selects the top- $k$ most relevant historical frames (those with co-visible rays) to attend to.
Efficiency: This reduces the attention complexity from quadratic to linear with respect to the number of frames, enabling efficient streaming generation while preserving loop-closure fidelity.

C. Progressive Training Pipeline

To stabilize training, the authors employ a four-stage curriculum:

Short-clip Teacher Forcing: Aligns the model with autoregressive generation.
ViewRope Integration: Learns view-conditioned correspondence on short clips.
Frame-Sparse Activation: Adapts the model to long-context retrieval.
Context Scaling: Scales up sequence length significantly.

3. Key Contributions

ViewRope: A novel positional encoding that injects patch-level camera-ray directions into attention, providing a model-native inductive bias for 3D consistency without external memory.
Geometry-Aware Frame-Sparse Attention: An efficient retrieval mechanism that selects co-visible historical frames based on geometric cues, enabling long-video generation with low latency.
ViewBench: A new diagnostic benchmark specifically designed to evaluate view-consistency in camera-conditioned generation. It features loop-closure trajectories (rotate-away-rotate-back) with controlled 3-axis rotations (yaw, pitch, roll) across diverse UE5 environments.

4. Experimental Results

Experiments were conducted on ViewBench and compared against state-of-the-art baselines (3D RoPE, GTA, Matrix-Game-2, HY-WorldPlay).

Loop-Closure Performance: ViewRope significantly outperforms all baselines in Loop Closure Error (LCE).
- Compared to the strongest baseline (GTA), ViewRope reduced LCE by 4%.
- Compared to interactive world models (HY-WorldPlay), ViewRope reduced LCE by 6.5% to 11.4% depending on the rotation angle (30°–75°).
- The performance gap widens with larger rotation angles, proving the efficacy of ray-based attention for large camera excursions.
Visual Quality: ViewRope maintains competitive or superior visual quality (PSNR, SSIM, LPIPS) compared to baselines, demonstrating that geometric consistency does not come at the cost of generation fidelity.
Efficiency: The Geometry-Aware Sparse Attention reduces training time by ~25% on long sequences (201 frames) compared to dense attention, while maintaining stable convergence (unlike naive sparse attention which diverges).
Ablation Studies:
- Embedding ViewRope in the low-frequency bands of the temporal dimension yielded the best results.
- Counterfactual experiments confirmed that the model's selection of historical frames is causally necessary; randomly selecting frames or excluding the top-k selected frames caused significant performance degradation (up to 38% increase in LCE).

5. Significance

Bridging 3D and Generative AI: ViewRope successfully bridges the gap between rigid 3D geometric consistency and the flexibility of open-domain diffusion models. It achieves 3D consistency implicitly through attention mechanisms rather than explicit 3D reconstruction or external memory banks.
Scalability: By combining geometric priors with sparse attention, the method makes long-horizon, interactive video generation computationally feasible.
Benchmarking: The introduction of ViewBench provides a standardized, rigorous metric for evaluating spatial memory in world models, addressing a gap in current evaluation suites that focus primarily on perceptual quality.
Future Impact: This work paves the way for more robust interactive AI systems, VR/AR applications, and game engines where maintaining a consistent world state over time is critical.

In summary, ViewRope redefines positional encoding in video transformers by treating "view" as "position," enabling models to "remember" 3D structures through geometric ray alignment rather than pixel proximity.