SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

SPIRAL is a closed-loop framework that enhances controllable long-horizon video generation by integrating a reflective planning process with iterative action world modeling, enabling self-improvement through explicit planning, object-centric decomposition, and feedback-driven refinement.

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a very talented, but slightly scatterbrained, artist how to paint a complex scene based on a verbal description.

If you just say, "Paint a soccer player dribbling past a defender and scoring a goal," the artist might:

  1. Paint the player dribbling, then suddenly stop halfway.
  2. Make the player jump over the defender like a cartoon character (because that's not how soccer works).
  3. Forget the goal entirely and just paint a random field.

This is exactly the problem current AI video generators face. They are great at making short, pretty clips, but when you ask them to do a long, multi-step action, they get lost, hallucinate physics, or give up halfway through.

Enter SPIRAL. Think of SPIRAL not as a single artist, but as a highly organized film production crew working together to fix these mistakes.

Here is how the "crew" works, broken down into simple roles:

1. The Director (The PlanAgent)

Instead of letting the AI just "guess" the whole movie at once, SPIRAL introduces a Director.

  • What they do: Before a single frame is drawn, the Director reads your request ("Dribble, cross over, shoot") and breaks it down into a strict script.
  • The Analogy: Imagine a chef reading a recipe. They don't just throw ingredients in a pot; they say, "First, chop the onions. Then sauté them. Then add the tomatoes." The Director creates this step-by-step checklist for the AI to follow.

2. The Camera Crew (The World Model)

This is the actual AI that generates the video pixels.

  • What they do: They follow the Director's script. But instead of trying to film the whole movie in one take, they film it scene by scene.
  • The Analogy: They are like a camera operator who only focuses on the current step. "Okay, Director says 'chop onions,' so I will film chopping onions." Once that's done, they pause and wait for the next instruction.

3. The Script Supervisor (The CriticAgent)

This is the most important new addition. In normal AI, the camera crew films, and you only see the final movie. If there's a mistake, it's too late. In SPIRAL, we have a Script Supervisor watching the monitor in real-time.

  • What they do: As soon as the Camera Crew finishes a scene (e.g., the "chopping onions" part), the Supervisor checks it.
    • Did they actually chop the onions? (Action Completeness)
    • Did the knife hit the cutting board, or did it float in the air? (Physical Fidelity)
    • Did they skip a step?
  • The Feedback Loop: If the Supervisor sees a mistake (e.g., "The knife didn't touch the board!"), they don't let the movie move on. They yell, "Cut! Do it again, but make sure the knife hits the board!" The Camera Crew then re-films that specific scene until it's perfect.

4. The Memory Bank (World Memory)

Long movies have a problem: characters forget who they are or what they were doing five minutes ago.

  • The Analogy: SPIRAL keeps a photo album of everything that has happened so far. When filming the next scene, the Camera Crew looks at the album to remember, "Oh right, the player is wearing a red shirt and is currently on the left side of the field." This prevents the video from drifting into nonsense.

5. The "Self-Improving" Coach (GRPO Training)

This is the secret sauce that makes the system get smarter over time.

  • The Analogy: Imagine the Director, Camera Crew, and Supervisor practice this movie 100 times. Every time they make a mistake, the Supervisor gives them a score.
  • The Result: The AI learns from these scores. It starts to "internalize" the rules. Eventually, the Camera Crew doesn't need the Supervisor to yell "Cut!" as often because they have learned to get it right the first time. They evolve from a novice actor into a professional.

Why is this a big deal?

Previous AI video tools were like one-shot photographers: They took one picture based on a prompt. If the prompt was complex, the picture was usually wrong.

SPIRAL is like a movie studio with a safety net.

  • It thinks first: It plans the steps.
  • It acts: It generates the video.
  • It reflects: It checks for errors and fixes them immediately.

This allows the AI to generate long, complex videos (like a full cooking tutorial or a sports play) where the physics make sense, the steps are completed, and the story doesn't fall apart halfway through. It turns video generation from "guessing" into "planning and executing."