How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

This paper introduces UniLongGen, a training-free inference strategy that improves long-horizon interleaved image generation by dynamically curating context to discard accumulated visual noise, thereby overcoming the reliability collapse caused by dense visual token interference in unified multimodal models.

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are a master storyteller who can also draw. You want to tell a long, epic story where you write a paragraph, draw a picture, write another paragraph, draw another picture, and keep going for 40 or 50 turns.

This is what Unified Multimodal Models (AI that handles both text and images) are trying to do. But there's a big problem: The longer the story gets, the worse the drawings become.

By the time the AI reaches the 20th picture, the characters start looking like melted wax, the style changes randomly, and the story makes no sense. It's like a painter who starts with a masterpiece but, after 20 paintings, can't remember what the main character looks like anymore.

This paper, UniLongGen, figures out why this happens and gives the AI a simple trick to fix it.

The Problem: The "Cluttered Desk" Effect

Usually, when AI fails at long tasks, we think it's because it's "forgetting" things (like running out of memory). The authors say: No, that's not it.

Think of the AI's memory like a desk where it keeps all the reference photos and notes for the story.

  • Text is like notes: If you have 1,000 pages of notes, the AI might get a little confused, but it can still find the right one.
  • Images are like giant posters: Every time the AI draws a picture, it adds a massive, high-resolution poster to the desk.

The Real Issue:
When you have 20 posters on your desk, they start fighting for your attention. The AI looks at the current task ("Draw a cat") and gets distracted by a random poster from 15 turns ago that happens to have a blurry shape that looks slightly like a cat.

Because the AI is so good at finding patterns, it latches onto these "accidental matches." It's like trying to listen to a friend in a crowded room, but suddenly 20 people start shouting random words that sound kind of like what your friend said. The AI gets hijacked by these "noise" signals, and the drawing goes off the rails.

The authors call this "Active Pollution." The old images aren't just fading away; they are actively corrupting the new ones.

The Solution: The "Smart Curator"

The paper proposes a method called UniLongGen. Instead of trying to remember everything (which causes the clutter), the AI learns to curate its memory.

Imagine the AI has a Smart Curator (a helpful assistant) who stands by the desk. Here is how the Curator works:

  1. The "One-Shot" Check: Before drawing the next picture, the Curator quickly glances at the entire history of the story (all the text and all the old pictures).
  2. The "Relevance" Test: The Curator asks the AI: "Which of these old pictures actually helps me draw the NEXT one?"
    • It ignores pictures that are just "there."
    • It ignores pictures that are too old to matter.
    • It picks only the top 4 or 5 most important images that define the character's face or the story's style.
  3. The "Trash Bin" (Crucial Step): This is the most important part. The Curator doesn't just hide the other pictures; it throws them away (removes them from the AI's immediate memory).
    • Why? If you just hide them, the AI might still peek at them and get distracted. If you throw them away, the AI cannot be distracted by them.

The "Two-Layer" Trick

The paper also discovered that the AI brain works in layers, like a factory assembly line:

  • Early Layers: These are good at reading the text instructions ("Draw a cat wearing a hat").
  • Late Layers: These are good at the actual drawing (the pixels and colors).

UniLongGen uses a clever split:

  • For the text instructions, it keeps the relevant text history.
  • For the drawing, it keeps only the relevant image history.
    It doesn't mix them up. It gives the "text brain" the text it needs and the "drawing brain" the pictures it needs, and nothing else.

The Result

By using this "Smart Curator" approach:

  • Quality stays high: The AI can draw 40+ pictures in a row, and the 40th picture looks just as good as the 1st.
  • Characters stay consistent: The main character doesn't turn into a monster halfway through.
  • It's faster: Because the AI isn't trying to look at 1,000 old posters, it works much faster and uses less computer power.

In a Nutshell

Old Way: "Remember everything! Keep every single note and picture we ever made!" -> Result: The desk gets so messy the AI can't work.

UniLongGen Way: "Only keep the 5 most important pictures and the 5 most important notes. Throw the rest in the trash so we don't get distracted." -> Result: The AI stays focused, creative, and consistent for as long as the story goes on.

It's not about having a bigger memory; it's about having a cleaner workspace.