Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

The paper introduces "Narrative Weaver," a novel framework that achieves controllable, long-range visual consistency in generative AI by integrating multimodal narrative planning with a dynamic memory bank, validated through extensive experiments and a newly released e-commerce advertising dataset.

Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang, Yanye Lu

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are a director trying to make a movie. You have a script, a cast, and a vision. But there's a problem: every time the camera cuts to a new scene, the actors' faces change, their clothes change color, and the background suddenly looks like a different planet. This is the current state of most AI video generators—they are great at making short, beautiful clips, but they lose their mind when asked to tell a long story.

Enter Narrative Weaver. Think of it as a "Super-Director" AI that finally solves this problem. It doesn't just generate images; it weaves a consistent, long-term story where characters and settings stay true to themselves from the first frame to the last.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Drifting" Actor

Current AI models are like actors who forget their lines and costumes after every take. If you ask them to show a woman in a red dress walking through a park, then sitting on a bench, then walking again, the AI might suddenly make her wear a blue shirt, change her hair color, or teleport her to a beach. This is called "visual drift."

2. The Solution: The "Three-Person Crew"

Narrative Weaver acts like a highly organized film crew with three distinct roles working together:

  • The Screenwriter (The MLLM): This is a smart language model that acts as the "Director." Before drawing a single picture, it reads your idea and writes a detailed storyboard. It plans the plot: "First, the woman stands. Then she sits. Then she smiles." It keeps the story logic tight.
  • The Memory Bank (The Continuity Supervisor): This is the secret sauce. Imagine a sticky note on the director's desk that says, "Remember: The woman has curly brown hair and a red dress." Every time the AI draws a new frame, it checks this "Memory Bank." It looks at the previous pictures and the original reference to ensure the character hasn't accidentally turned into a different person. It prevents the "drift."
  • The Artist (The Diffusion Model): This is the actual painter. It takes the Screenwriter's plan and the Continuity Supervisor's notes to create the actual image. Because it has the plan and the memory, it knows exactly what to paint to keep the story consistent.

3. The Training: Learning in Three Steps

You can't just throw a beginner into a Hollywood production. The authors trained Narrative Weaver in three specific stages, like a student actor:

  1. Stage 1 (The Scriptwriter): They taught the AI how to write a good story and plan the shots. It learned to say, "Okay, scene 1 is a close-up, scene 2 is a wide shot."
  2. Stage 2 (The Translator): They taught the AI how to translate that text plan into visual concepts. It learned to say, "When the script says 'sunset,' I need to paint orange skies."
  3. Stage 3 (The Detail Artist): Finally, they taught it the hard part: Consistency. They showed it thousands of examples of how a character should look in different poses so it could learn to keep the face and clothes the same, no matter what the character is doing.

4. The New "Script Library" (The Dataset)

One of the biggest hurdles was that no one had a good library of "long stories" to teach the AI. Most existing data was just short, random clips.
So, the team built a new library called EAVSD (E-commerce Advertising Video Storyboard Dataset).

  • The Analogy: Imagine trying to teach a child to write a novel, but you only gave them post-it notes with single sentences. It's impossible. Narrative Weaver needed a whole library of storyboards. They created 330,000 high-quality "storyboards" (sequences of images) specifically for advertising, where a product must look exactly the same in every single shot. This became the textbook for the AI.

5. Why It Matters: From Posters to Movies

Why do we care?

  • For E-commerce: Imagine an ad where a model tries on a jacket. The AI can generate 20 different scenes of that model wearing the exact same jacket in a park, a cafe, and a street, without the jacket changing color or the model's face morphing.
  • For Filmmaking: It opens the door to AI-assisted storytelling where you can generate entire short films or comic books with a consistent style and characters, rather than just random, disconnected images.

The Bottom Line

Narrative Weaver is the first AI that understands that a story is more than just a collection of pretty pictures. It's a chain of events where the characters must remain the same. By combining a planner (who knows the story), a memory (who remembers the details), and an artist (who draws the pictures), it creates long, coherent visual narratives that actually make sense.

It's the difference between a child drawing random pictures on a page and a professional animator creating a cohesive cartoon.