UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

UCM is a novel framework that unifies long-term memory and precise camera control in world models through a time-aware positional encoding warping mechanism and an efficient dual-stream diffusion transformer, achieving superior scene consistency and controllability in high-fidelity video generation.

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are directing a movie, but instead of a real set and actors, you are asking an AI to dream up the entire world frame by frame. You want the camera to fly over a mountain, dive into a valley, and then circle back to the exact same spot to see a bird that was there earlier.

The problem with current AI movie-makers is that they have short memories and bad spatial awareness.

  • The Memory Problem: If the camera flies away and comes back, the AI often forgets what the mountain looked like. It might turn the mountain into a cloud or change the color of the rocks. The world "drifts" and becomes inconsistent.
  • The Camera Problem: If you tell the AI to "move left," it often just pans the image like a TV screen, rather than actually moving a 3D camera through a 3D world.

The paper introduces a new system called UCM (Unifying Camera Control and Memory) to fix this. Here is how it works, explained with everyday analogies:

1. The "Time-Aware GPS" (The Core Innovation)

Most AI models treat video frames like a stack of flat photographs. UCM treats them like a 3D hologram with a GPS tag on every single pixel.

  • The Old Way: Imagine trying to remember a room by looking at a photo of it. If you walk around the room and look at the photo again, you can't easily tell where the photo was taken relative to your new position.
  • The UCM Way: UCM uses something called "Time-Aware Positional Encoding Warping." Think of this as giving every pixel in the video a living GPS coordinate that updates in real-time.
    • When the camera moves, the system doesn't just slide the image; it "warps" the coordinates of the old pixels to match the new camera angle.
    • It's like having a magic map that instantly tells the AI: "Hey, that pixel you saw 10 seconds ago on the left wall is now directly in front of you because we turned the camera."
    • This allows the AI to remember exactly what a scene looked like from any angle, ensuring that when the camera returns, the world looks exactly the same.

2. The "Dual-Stream Chef" (Efficiency)

Usually, to make a video consistent, you have to feed the AI all the previous frames every time it generates a new one. This is like trying to cook a gourmet meal while reading a 1,000-page cookbook every time you chop an onion. It's slow and expensive.

UCM introduces a Dual-Stream Diffusion Model, which is like having two specialized chefs working in a kitchen:

  • Chef A (The Librarian): Handles the "Memory." They hold the clean, perfect reference images (the old frames) and just make sure the ingredients are ready. They don't do the heavy lifting of cooking.
  • Chef B (The Cook): Handles the "Creation." They take the noisy, messy ingredients and cook the new video frames.
  • The Magic: Chef B only looks at Chef A's notes when they need to check a specific detail. They don't read the whole book every time. This makes the process much faster and allows the AI to remember a huge amount of history without getting overwhelmed or crashing.

3. The "Virtual Time Traveler" (Training Data)

To learn how to do this, an AI usually needs thousands of videos where a camera flies around a scene and comes back to the start. These videos are rare in the real world.

The authors solved this with a clever Data Curation Strategy:

  • Imagine you have a video of a person walking down a street (a single camera view).
  • UCM takes that video, builds a 3D point-cloud model (a digital skeleton of the street) from it, and then virtually "time travels" the camera.
  • It renders the street from a new angle that the original camera never saw, creating a fake "revisit" video.
  • It's like taking a photo of a cake, building a 3D model of it, and then taking a picture of the cake from behind the camera. This allows them to train the AI on 500,000+ videos that it would otherwise never see.

The Result: A World That "Sticks"

Because of these tricks, UCM can generate long, high-quality videos where:

  1. The Camera is obedient: You can draw a complex path (loop, spiral, dive), and the AI follows it perfectly.
  2. The World is consistent: If you fly around a house and come back to the front door, the door looks exactly the same. The trees haven't moved, and the clouds haven't changed shape.

In summary: UCM is like giving an AI a 3D memory bank and a GPS tracker for every pixel, allowing it to generate infinite, consistent, and camera-controlled worlds without getting confused or running out of memory.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →