AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?

Imagine you want to tell a story using pictures, like a comic book or a storyboard for a Disney movie. You have a script (the story) and a few reference photos of your characters. Your goal is to generate a sequence of images where the characters look exactly the same in every shot, the story makes sense, and the poses are dynamic and expressive.

This is the challenge of Custom Storyboard Generation (CSG).

The paper introduces AnimeAgent, a new AI system designed to solve the problems current AI tools face when trying to do this. Here is a simple breakdown of how it works, using some everyday analogies.

The Problem: The "Copy-Paste" Robot vs. The "Disney" Artist

Current AI tools for making storyboards are like robots that only know how to copy and paste.

The Static Trap: Most AI generates one picture at a time. If you ask it for "Snow White walking," it might draw her perfectly. But if you ask for the next picture of her walking, the robot forgets what she looked like in the first picture. Her hair changes color, her dress changes style, or she suddenly has three arms. It's like a robot trying to draw a movie by drawing a new character from scratch for every single frame.
The "One-Shot" Mistake: If the robot gets the first picture wrong (e.g., Snow White is holding a sword instead of an apple), it can't fix it. It just moves on to the next picture with the mistake, making the whole story confusing.
The Bad Judge: When these systems try to check their own work, they use "judges" (algorithms) that are easily fooled. They might think a picture is good just because it looks colorful, even if the character is holding a banana instead of a sword, or if the character's face is distorted.

The Solution: AnimeAgent (The "Disney Studio" Team)

The authors of this paper realized that to make a good story, you don't need a robot; you need a team of artists working like a real Disney studio. They built AnimeAgent, which uses three specialized AI "agents" (digital workers) to mimic the human animation process.

1. The Director (The Screenwriter)

Role: Before any drawing happens, the Director reads your messy, simple prompt (e.g., "Snow White walks in the forest") and turns it into a super-detailed script called a "Textual Dope Sheet."
Analogy: Think of this like a human director telling the crew exactly what to do. Instead of just saying "Walk," the Director specifies: "Snow White, wearing her blue dress with a red bow, walks slowly through a dense forest, looking sad, with the dwarfs' house visible in the distance."
Why it helps: It removes the guesswork. The AI knows exactly who the characters are and what the scene looks like before it starts drawing.

2. The Artist (The Animator)

Role: This agent uses a special Image-to-Video (I2V) model. Instead of drawing one static picture, it generates a short video clip of the action.
Analogy: Imagine asking a human animator to draw a character walking. They don't draw the start and end separately; they draw the movement. The AI does the same. It creates a smooth "motion trajectory."
The Magic Trick: Because it's generating a video, the AI "remembers" the character's face and clothes from the first frame as it moves through the scene. It's like a puppeteer moving a marionette; the puppet stays the same, but the movement is fluid. This solves the "copy-paste" problem.

3. The Reviewer (The Critic)

Role: This is the quality control team. It doesn't just look at the final picture; it watches the whole video and picks the best moments (the "Extremes").
Analogy: In animation, the most important frames are the "key poses" (like the moment a character jumps or the moment they cry). The Reviewer watches the video, finds these peak moments, and checks:
- Did the character look like the reference? (Consistency)
- Did the story make sense? (Logic)
- Is it beautiful? (Aesthetics)
The Loop: If the Reviewer sees a mistake (e.g., "Wait, Snow White is wearing a hat in this shot, but she shouldn't be"), it tells the Director to fix the script, and the Artist redraws the scene. This happens in a loop until it's perfect.

The Secret Sauce: "Straight Ahead" vs. "Pose to Pose"

The paper mentions a classic Disney technique called "Straight Ahead and Pose to Pose."

Pose to Pose: Drawing the start and end, then filling in the middle. (Good for structure, but can feel stiff).
Straight Ahead: Drawing frame-by-frame from start to finish. (Good for fluid motion, but hard to control).

AnimeAgent combines them. It uses the Director to set the structure (Pose to Pose) and the Artist to generate fluid motion (Straight Ahead). This gives you a story that is both logically sound and full of life.

The Results: Why is this a big deal?

The researchers tested AnimeAgent against other AI tools and even commercial platforms (like those from big tech companies).

Better Characters: The characters stayed consistent (no weird face swaps).
Better Stories: The AI actually followed the script instead of making up random things.
Better Art: The images looked more like professional animation and less like a glitchy video game.

They even created a new "test" (a dataset with human-annotated ground truth) to prove that their system is actually better at telling stories than the old ways.

Summary

AnimeAgent is like upgrading from a photocopier (which just copies static images and loses details) to a live animation studio (where a director plans the scene, an animator draws the movement, and a critic ensures the story makes sense). It's the first AI system designed specifically to tell high-quality, consistent animated stories, just like the Disney legends did.

1. Problem Statement

Custom Storyboard Generation (CSG) aims to automatically produce high-quality, multi-character consistent storytelling images from character references, scenes, and scripts. Current approaches face three critical limitations:

Static Limitations: Existing methods rely on Text-to-Image (T2I) diffusion models. These lack dynamic expressiveness, often resulting in "copy-paste" patterns where characters and backgrounds fail to interact naturally or maintain complex poses.
One-Shot Inference: Most methods generate images in a single forward pass. This prevents iterative correction of missing attributes, poor prompt adherence, or logical inconsistencies.
Fragile Evaluation: Multi-agent frameworks that attempt iterative refinement often rely on non-robust evaluators (e.g., CLIP scores or raw MLLM ratings). These are insensitive to artistic distortions, stylized animation, and dynamic expressiveness, leading to ineffective feedback loops.

2. Methodology: AnimeAgent

The authors propose AnimeAgent, the first multi-agent framework for CSG based on Image-to-Video (I2V) models. The system is inspired by the classic Disney animation workflow combining "Straight Ahead" (fluid, sequential drawing) and "Pose to Pose" (structured keyframe interpolation).

The framework consists of three core agents:

A. Director Agent (Structured Guidance)

Function: Translates non-expert user inputs (scripts, reference images) into a structured Hierarchical Textual Dope-Sheet (TS).
Mechanism: Uses Multimodal Large Language Models (MLLMs) to parse inputs into five dimensions: Characters, Shots, Scene, Composition, and Relationships.
Innovation: Introduces a binary linkage flag ( $L$ ) to enforce inter-shot consistency. If $L=T$ , the output of the previous shot serves as the reference for the next, ensuring identity and appearance continuity across the sequence.

B. Artist Agent (Dynamic Generation)

Function: Generates a continuous motion trajectory using an I2V model.
Mechanism:
1. Visual Dope-Sheet: The first frame acts as a visual anchor, initialized from reference images to lock in character identity and style.
2. Spatio-Temporal Reasoning: The I2V model performs chain-of-frame generation. Unlike static T2I models that rigidly copy compositions, the I2V model implicitly understands spatial relationships and motion dynamics, allowing for coherent multi-character interactions and plausible layouts without explicit supervision.
Dual Dope-Sheet Strategy: The system aligns the Textual Dope-Sheet (from the Director) with the Visual Dope-Sheet (the I2V first frame) to ensure the generated trajectory is both prompt-faithful and visually expressive.

C. Reviewer Agent (Iterative Refinement & Selection)

The Reviewer operates in two stages to close the loop:

Consistency Reviewer (Iterative Refinement):
- Samples keyframes from the generated video.
- Uses an MLLM to generate a captioned Dope-Sheet ( $TS_{cap}$ ) and compares it against the original ( $TS_{ori}$ ) using cosine similarity.
- If consistency scores drop below a threshold (0.8), it triggers localized updates to the Director's textual prompt, allowing for iterative correction of identity drift or layout errors.
Mixed Reviewer (Extreme Selection):
- Inspired by "Pose to Pose," this agent selects the most expressive frames (EXTREMES) rather than just the final frame.
- Objective Reviewer: Calculates an Aesthetic Score (AES) and a Motion Score (MOS). The MOS uses optical flow tracking (LightGlue + CoTracker) to measure dynamic intensity, distinguishing foreground motion from camera movement.
- Subjective Reviewer: Uses an MLLM to evaluate narrative alignment, expressiveness, and persuasiveness on a 0-5 scale.
- Selection: Combines objective and subjective scores to select the top 5 candidate frames as the final storyboard.

3. Key Contributions

First I2V-based Multi-Agent Framework: Introduces AnimeAgent, shifting CSG from static T2I generation to dynamic, trajectory-aware modeling.
Hybrid Workflow & Dope-Sheets: Proposes a "Dual Dope-Sheet" mechanism (Textual + Visual) and a "Straight Ahead + Pose to Pose" generation strategy to balance narrative structure with fluid motion.
Robust Evaluation System: Develops a mixed-reviewer framework that fuses automated metrics (Motion/Aesthetic scores) with subjective MLLM analysis, overcoming the limitations of CLIP-based evaluation in stylized animation.
New Benchmark (AnimeBoard-GT): Collects a human-annotated CSG dataset with ground-truth storyboards to enable faithful evaluation, addressing the lack of reliable benchmarks in the field.

4. Experimental Results

The authors evaluated AnimeAgent against 13 state-of-the-art CSG methods and 9 commercial platforms (e.g., Doubao, GPT-4o, Morphic Studio) on ViStoryBench and their new AnimeBoard-GT.

Quantitative Performance:
- Consistency: Achieved SOTA Character Identity Similarity (CIDS: 0.785) and Style Similarity (CSD: 0.616), significantly outperforming baselines which suffer from identity drift.
- Prompt Fidelity: Achieved the highest Prompt Alignment (PA: 3.61 avg), demonstrating superior adherence to complex narrative instructions.
- Generalization: On the AnimeBoard-GT dataset, AnimeAgent achieved the highest CLIP-I similarity (0.786) to ground truth.
Qualitative Performance:
- Visual comparisons (e.g., Alice in Wonderland, Aladdin stories) show AnimeAgent maintains consistent character attire and expressions, whereas baselines exhibit "copy-paste" backgrounds, rigid poses, or logical inconsistencies.
User Study: In a study with 20 annotators, AnimeAgent ranked highest in Character Consistency (3.95), Prompt Alignment (3.70), and Aesthetic Quality (3.42) compared to top commercial models.
Ablation Studies: Confirmed that removing the Director (structured guidance) or the Mixed Reviewer significantly degrades performance, validating the necessity of the iterative refinement loop and structured prompting.

5. Significance

This work represents a paradigm shift in automated storytelling. By leveraging Image-to-Video models as the generative engine, AnimeAgent solves the fundamental "static" problem of T2I models, enabling natural character interactions and dynamic motion. The introduction of a Disney-inspired iterative workflow (Director-Artist-Reviewer) provides a blueprint for how multi-agent systems can be designed to handle complex, stylized creative tasks that require both structural logic and artistic expressiveness. The release of the AnimeBoard-GT benchmark further establishes a new standard for evaluating CSG systems.

Limitations: The current system relies on large MLLM and I2V backbones, resulting in high computational costs and latency. Future work aims to optimize these models for real-time deployment.