SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

This paper introduces SPATIALALIGN, a self-improvement framework that enhances text-to-video generation by fine-tuning models with a zeroth-order regularized Direct Preference Optimization method and a novel geometry-based DSR-SCORE metric to better align generated videos with dynamic spatial relationships specified in text prompts.

Fengming Liu, Tat-Jen Cham, Chuanxia Zheng

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are a director giving instructions to a very talented, but slightly confused, movie actor. You say, "Start standing on the left of the tree, then walk to the right of the tree."

A human actor understands this immediately. But if you ask a current AI video generator (like the ones making movies from text), it often gets the script wrong. It might start on the right, or it might just stand in the middle and never move, or it might teleport. It sees the words "left" and "right," but it doesn't truly understand the geometry of the movement.

This paper, SPATIALALIGN, is like a specialized coach that teaches these AI actors how to follow spatial directions perfectly. Here is how they did it, broken down into simple concepts:

1. The Problem: The AI is "Spatially Clueless"

Current AI video generators are great at making things look pretty (good lighting, nice colors), but they are terrible at logic. If you tell them, "A cat jumps from the top of a box to the bottom," the AI might make the cat jump up instead, or make the box disappear. They prioritize "looking cool" over "making sense."

2. The Solution: A New "Ruler" (DSR-SCORE)

To teach the AI, you first need a way to grade its homework.

  • The Old Way: Previously, researchers used other AI models (Vision-Language Models) to watch the video and say, "Yes, that looks right" or "No, that looks wrong." The authors found this unreliable. It's like asking a colorblind person to judge if a painting is red. They might guess, but they aren't precise.
  • The New Way (DSR-SCORE): The authors built a mathematical ruler. Instead of asking an AI to "guess" if the video is right, they use a computer program to literally measure the distance between the animal and the object in every single frame.
    • Analogy: Imagine the video is a graph. The "ruler" checks: "Did the animal start on the left side of the graph? Did it end on the right side? Did it move smoothly in between?" It gives a score from 0 to 1 based on pure geometry, not guesswork.

3. The Training: The "Taste Test" (DPO)

Now that they have a ruler to measure the videos, they need to teach the AI to make better ones.

  • The Old Way (Supervised Fine-Tuning): This is like showing the AI a thousand perfect videos and saying, "Copy this." The problem is the AI might just memorize the pictures without learning the rule. It's like a student memorizing the answer key instead of learning math.
  • The New Way (Direct Preference Optimization - DPO): This is more like a taste test.
    1. The AI generates 10 different videos for the same prompt.
    2. The "Ruler" (DSR-SCORE) grades them.
    3. The AI is shown the "Winner" (the video that moved correctly) and the "Loser" (the video that failed).
    4. The AI is told: "You like the Winner more than the Loser. Adjust your brain so you make more Winners."
    • Analogy: It's like training a dog. You don't just show the dog a picture of a "sit." You wait for it to sit, give it a treat (the "Winner" signal), and ignore it when it stands up (the "Loser" signal). Over time, the dog learns the behavior.

4. The Secret Sauce: "Zeroth-Order Regularization"

There was a catch. When the AI tried to learn from these "Winners" and "Losers," it got a bit crazy. It started making videos that followed the direction perfectly but looked weird (like the colors were too bright or the animal looked like a blob). It was "cheating" to get the score.

The authors added a safety net (Zeroth-Order Regularization).

  • Analogy: Imagine the AI is a student taking a test. The "Winner/Loser" method tells them which answers are right. But the safety net says, "Don't change your handwriting or your style just to get the right answer; keep your natural look." It keeps the video looking natural while still teaching it the spatial rules.

The Result

After this training, the AI became a master of direction.

  • Before: "A fox is on the left of a stump, then moves to the right." -> The AI made the fox stay in the middle or move the wrong way.
  • After: The AI generates a video where the fox clearly starts on the left and walks smoothly to the right, exactly as requested.

Why This Matters

This isn't just about foxes and stumps. It's about teaching AI to understand physics and logic in the real world. If we want AI to help robots navigate a room, or simulate how objects interact, the AI needs to understand that "left" is different from "right," and that things move in space. SPATIALALIGN gives AI a brain for spatial reasoning, not just a pretty face.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →