RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

Imagine you are trying to teach a robot to drive a car. To do this safely, the robot needs to understand not just what it sees right now, but how the world will change in the next second, the next minute, and even the next hour. It needs to predict the future.

This paper introduces RAYNOVA, a new kind of "brain" for robots that acts like a crystal ball for driving. Instead of just memorizing rules, it learns to imagine how the world evolves.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Rigid Blueprint" vs. The "Flexible Dream"

Most previous AI models for driving are like architects with a rigid blueprint. They try to force the world into a strict 3D grid (like a video game map).

The Issue: If the camera moves in a way the blueprint didn't expect, or if the car turns a sharp corner, the blueprint breaks. The AI gets confused because it relies too much on specific 3D geometry (like knowing exactly where every wall is in 3D space).

RAYNOVA is different. It's more like a dreamer. Instead of building a rigid 3D map, it learns the "flow" of the world using light rays.

The Analogy: Imagine you are in a dark room with many flashlights (cameras). Instead of trying to build a 3D model of the furniture, RAYNOVA just tracks the beams of light. It understands that if a beam of light hits a tree, and the tree moves, the beam changes. It doesn't care where the tree is in a global map; it only cares about the relationship between the light and the object. This makes it incredibly flexible.

2. The Secret Sauce: "Ray Space" and "Relative Position"

The paper introduces a clever trick called Plücker-ray positional encoding.

The Analogy: Think of a standard GPS. It tells you your location based on a fixed map (Latitude/Longitude). If you move to a new city, the map coordinates change, and you have to relearn everything.
RAYNOVA's Approach: It uses relative directions. Instead of saying "The tree is at coordinates X, Y, Z," it says, "The tree is 5 degrees to the right of the light beam."
Why it matters: Because it uses relative directions, RAYNOVA can look at a scene from a brand new camera angle it has never seen before (like a camera on a drone instead of a car) and still understand the scene perfectly. It's like knowing a song by its melody, regardless of which instrument is playing it.

3. The "Dual-Causal" Engine: Reading the Book Backwards and Forwards

Most video generators try to predict the next frame, then the next, then the next. RAYNOVA does something smarter called Dual-Causal Autoregression.

The Analogy: Imagine you are reading a book, but you are also drawing the pictures as you go.
1. Scale Causality (The Sketch): First, you draw a rough sketch of the whole page (low resolution). Then, you add details to the sketch (medium resolution). Finally, you add the fine details (high resolution). You don't try to draw the final picture all at once; you build it layer by layer.
2. Time Causality (The Story): You also look at the previous pages to know what happens next.
The Magic: RAYNOVA does both at the same time. It builds the image from "rough to detailed" while simultaneously building the story from "past to future." This allows it to generate high-quality, long videos very quickly.

4. The "Recurrent Training" Fix: Practicing for the Long Haul

When AI generates long videos, it often starts to hallucinate or drift off course (like a student who forgets the beginning of a story by the time they get to the end).

The Solution: The authors created a training method called Recurrent Training.
The Analogy: Imagine a student taking a test. Usually, they only practice on short quizzes. But for a long movie, they need to practice writing a whole chapter. RAYNOVA is trained by being forced to generate a long video, then being told, "Okay, now pretend you made a small mistake in the first frame and try to continue from there." This teaches the AI to recover from its own errors, making it much more stable for long drives.

5. What Can It Do?

Because of these tricks, RAYNOVA is a Versatile World Foundation Model:

Zero-Shot Magic: You can show it a camera setup it has never seen (like a camera on a bicycle or a drone), and it will still generate a realistic video.
Control: You can tell it, "Put a red car here," or "Make it rain," and it will obey.
Speed: It generates video much faster than previous models because it builds images layer-by-layer (like a painter) rather than trying to fix noise pixel-by-pixel (like a sculptor chipping away stone).

Summary

RAYNOVA is a new AI that learns to drive by understanding the world through beams of light rather than rigid 3D maps. It builds videos like an artist sketching from rough to fine, and it practices for long journeys by learning to fix its own mistakes. This makes it a powerful, flexible, and fast tool for simulating the future of autonomous driving.

1. Problem Statement

World Foundation Models (WFMs) aim to simulate the evolution of real-world scenes with physical plausibility. However, existing approaches face significant limitations in driving scenarios:

Decoupled Design: Most methods handle spatial (multi-view) and temporal (multi-frame) correlations separately. This restricts flexibility when dealing with novel sensor configurations or rapid ego-motions (e.g., sharp turns).
Strong Geometric Priors: Many approaches rely on explicit 3D scene representations (e.g., point clouds, BEV features, or radiance fields). While effective in constrained domains, these "geometry-forcing" methods struggle to generalize to open-world environments or camera setups outside the training distribution.
Distribution Drift: Long-horizon video generation often suffers from error accumulation and distribution drift between training (teacher forcing) and inference (autoregressive generation).

2. Methodology: RAYNOVA

RAYNOVA is a geometry-agnostic, multiview world model that employs a dual-causal autoregressive framework operating in a unified 4D spatio-temporal space.

A. Dual-Causal Autoregressive Framework

Instead of standard "next-token" prediction, RAYNOVA utilizes a "next-scale prediction" strategy combined with temporal causality:

Scale Causality: Images are quantized into multiscale token maps ( $X_1, \dots, X_K$ ). The model predicts tokens at scale $k$ conditioned on all previous scales ($1$ to $k-1$ ) across all views.
Temporal Causality: The generation of the current frame ( $t$ ) is conditioned on all views from all previous frames ($1$ to $t-1$ ).
Joint Formulation: The model models the joint distribution $p(X_{1:V, 1:T}^{1:K})$ by iterating through time steps and, within each step, iterating through scales. This allows for unified reasoning across views and time.

B. Isotropic Spatio-Temporal Representation (Ray Space)

The core innovation is the Relative Plücker-ray Positional Encoding.

Ray Space: Instead of using absolute 3D coordinates or camera-specific embeddings, the model represents visual tokens using Plücker rays (defined by a ray origin $m$ and direction $d$ ).
Relative Encoding: The model computes the relative position between camera rays for any two tokens. This is achieved by extending Rotary Position Embedding (RoPE) to a 7D space (3D for origin, 3D for direction, 1D for time).
Benefits: This representation is isotropic and geometry-agnostic. It does not depend on specific camera overlaps, fixed sensor setups, or absolute world coordinates, enabling zero-shot generalization to unseen camera configurations and ego-motions.

C. Architecture

The model is built upon a Transformer architecture with three key attention mechanisms per block:

Image-wise Self-Attention: Ensures high-fidelity image realism (inherited from the "Infinity" model).
Global Self-Attention: Operates across all views and frames to enforce spatio-temporal consistency. This is the only module that sees the full 4D context.
Image-wise Cross-Attention: Conditions the generation on inputs like text, object bounding boxes, and HD maps.

D. Recurrent Training Paradigm

To address distribution drift in long-horizon generation:

The training process simulates inference by caching latent features from previous frames.
Random Bitwise Errors: During training, random bits in the input tokens are flipped to simulate prediction errors, forcing the model to be robust to noise and reducing the gap between training and inference.
Gradient Accumulation: The model performs recurrent forward/backward passes over long sequences, accumulating gradients to optimize long-term temporal coherence without exploding memory usage.

3. Key Contributions

Geometry-Agnostic World Model: RAYNOVA eliminates the need for explicit 3D priors (like point clouds or depth maps) by using relative ray-space encoding, allowing it to generalize to arbitrary camera setups and motions.
Dual-Causal Autoregression: A novel framework that unifies scale-wise and temporal autoregression, enabling coherent reasoning across multiple views and time steps simultaneously.
Isotropic 4D Positional Encoding: The extension of RoPE to Plücker rays allows the model to handle continuous 4D space with minimal inductive bias, supporting extrapolation beyond training distributions.
Recurrent Training for Long-Horizon: A training strategy that aligns training and inference distributions, significantly improving the stability of long video generation.

4. Experimental Results

Evaluated on the nuScenes and nuPlan datasets, RAYNOVA demonstrates state-of-the-art performance:

Video Generation Quality:
- Achieves FID of 10.5 and FVD of 91 on nuScenes, outperforming baselines like MagicDrive, Panacea, and BEVWorld.
- Throughput: Generates 1.96 images/second, significantly faster than diffusion-based baselines (e.g., 0.37–0.83 images/s).
Condition Fidelity:
- Object Control: Achieves 89% of the Oracle performance in NDS (nuScenes Detection Score) for object detection on synthetic videos.
- Map Control: Achieves 61% mIoU for map segmentation, robustly handling map projections.
Novel View Synthesis:
- Successfully synthesizes videos for unseen camera shifts (1m, 2m, 4m) and rotations (60°, 120°) in a zero-shot manner, outperforming specialized view synthesis models like StreetGaussian and OmniRe.
Physical Plausibility:
- When fed into an end-to-end planner (VAD), the generated videos produce driving actions consistent with real-world scenes (96–98% similarity to Oracle).

5. Significance

RAYNOVA represents a paradigm shift in world modeling for autonomous driving and simulation:

Scalability: It can ingest heterogeneous data from diverse sensors without requiring manual alignment or 3D reconstruction.
Flexibility: It supports diverse input conditions (text, maps, objects) and output formats (different resolutions, frame rates, and camera layouts) with a single model.
Efficiency: The autoregressive architecture offers high throughput and low latency compared to diffusion models, making it suitable for real-time simulation and planning.
Generalization: By removing strong geometric priors, it opens the door for world models to operate in unconstrained, open-world environments where 3D ground truth is unavailable.

In summary, RAYNOVA provides a robust, efficient, and highly generalizable framework for simulating the physical world, bridging the gap between generative AI and the rigorous demands of autonomous driving simulation.