Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective

Lumos-1 is a unified, LLM-based autoregressive video generation model that introduces MM-RoPE for balanced spatiotemporal modeling and Autoregressive Discrete Diffusion Forcing to address training inefficiencies, achieving state-of-the-art performance with significantly fewer computational resources than existing methods.

Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang

Published 2026-03-17
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a brilliant, well-read librarian (a Large Language Model, or LLM) how to paint moving pictures instead of just writing stories. The librarian is great at words, but when you ask them to draw a video, they get confused. They try to paint the video one tiny dot at a time, in a strict line, like writing a sentence. This is slow, and the dots often don't connect well, resulting in a blurry, jerky mess.

The paper introduces Lumos-1, a new way to teach this librarian to paint videos. It's like giving the librarian a magical new set of tools that let them paint the whole picture at once, understand how time flows, and fix their mistakes as they go.

Here is how Lumos-1 works, broken down into three simple magic tricks:

1. The "3D Compass" (MM-RoPE)

The Problem:
Standard LLMs use a "1D compass" (called RoPE) to know where words are in a sentence. It's like a ruler that only measures length. But a video isn't just a line; it's a 3D object with Time (how long it lasts), Height (up and down), and Width (left and right).
If you try to use a flat ruler to measure a 3D cube, the measurements get messy. The librarian gets confused about whether a bird is moving up or forward in time.

The Solution:
The authors invented MM-RoPE, a "3D Compass."

  • The Analogy: Imagine the librarian's brain is a giant orchestra. In the old system, the instruments for "Time," "Height," and "Width" were all crowded into the same small section of the room, playing at the same speed, causing a cacophony.
  • The Fix: MM-RoPE rearranges the orchestra. It gives the "Time" instruments their own spacious section, and it splits the "Height" and "Width" instruments into alternating seats so they can play together without stepping on each other's toes. This allows the model to perfectly understand the complex dance of a video, knowing exactly where an object is in space and how it moves through time.

2. The "Group Painting" Trick (Autoregressive Discrete Diffusion)

The Problem:
Old video models try to paint a video like a human writing a letter: they write one word (or pixel), then the next, then the next. This is called "next-token prediction." For a 25-second video, that's thousands of tiny steps. It's incredibly slow, and if you make a mistake on the first word, the whole sentence gets ruined.

The Solution:
Lumos-1 uses a technique called Discrete Diffusion, which is more like a "Group Painting" party.

  • The Analogy: Imagine you have a canvas covered in blank white squares. Instead of painting one square at a time, you cover the entire canvas with blank squares first. Then, you reveal a few squares at a time, let the model guess what goes there, and then reveal more. You repeat this until the whole picture is clear.
  • The Twist: The authors realized that if you just reveal random squares, the model gets lazy. It looks at the previous frame and just copies it, because that's the easiest way to fill in the blanks.
  • The Fix (AR-DF): They introduced a rule called Temporal Tube Masking. Imagine a tube going straight through the video from start to finish. If you hide a spot in the first frame, you must hide that exact same spot in every single frame that follows.
    • Why this works: The model can no longer just "copy-paste" from the previous frame. It has to actually predict how that specific spot changes over time. It forces the model to learn the physics of motion, not just the pattern of pixels.

3. The "Practice Run" (Inference Strategy)

The Problem:
Even with the Group Painting trick, if you try to generate a video without any "practice," the model gets confused. It sees a perfect first frame, but then tries to guess the second frame with no hints, leading to a jumpy, broken video.

The Solution:
The authors realized that during training, the model always saw some of the previous frames. So, during the actual creation (inference), they do the same thing.

  • The Analogy: Imagine you are taking a test. During practice, you are allowed to peek at the previous question's answer to help you with the next one. But on the real test, you are forced to do it blind. You will fail.
  • The Fix: When generating the video, Lumos-1 intentionally "blinds" itself to a portion of the frame it just created. It pretends it doesn't know everything, forcing itself to use the "Group Painting" logic to fill in the gaps. This keeps the motion smooth and the video consistent, just like it did during practice.

The Result?

By combining these three tricks, Lumos-1 can generate high-quality videos (Text-to-Video, Image-to-Video) using a standard LLM architecture.

  • Efficiency: It's much faster than previous methods because it doesn't paint pixel-by-pixel.
  • Quality: It understands motion and space better because of the 3D Compass.
  • Simplicity: It doesn't need a massive, separate brain to understand text; it uses the same brain for both reading and painting.

In short, Lumos-1 takes a text expert, gives them a 3D map, teaches them to paint in groups instead of lines, and forces them to practice with the lights dimmed. The result is a model that can turn a simple sentence into a beautiful, moving movie.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →