SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Imagine you are a director giving instructions to a very talented, but slightly confused, movie actor. You say, "Start standing on the left of the tree, then walk to the right of the tree."

A human actor understands this immediately. But if you ask a current AI video generator (like the ones making movies from text), it often gets the script wrong. It might start on the right, or it might just stand in the middle and never move, or it might teleport. It sees the words "left" and "right," but it doesn't truly understand the geometry of the movement.

This paper, SPATIALALIGN, is like a specialized coach that teaches these AI actors how to follow spatial directions perfectly. Here is how they did it, broken down into simple concepts:

1. The Problem: The AI is "Spatially Clueless"

Current AI video generators are great at making things look pretty (good lighting, nice colors), but they are terrible at logic. If you tell them, "A cat jumps from the top of a box to the bottom," the AI might make the cat jump up instead, or make the box disappear. They prioritize "looking cool" over "making sense."

2. The Solution: A New "Ruler" (DSR-SCORE)

To teach the AI, you first need a way to grade its homework.

The Old Way: Previously, researchers used other AI models (Vision-Language Models) to watch the video and say, "Yes, that looks right" or "No, that looks wrong." The authors found this unreliable. It's like asking a colorblind person to judge if a painting is red. They might guess, but they aren't precise.
The New Way (DSR-SCORE): The authors built a mathematical ruler. Instead of asking an AI to "guess" if the video is right, they use a computer program to literally measure the distance between the animal and the object in every single frame.
- Analogy: Imagine the video is a graph. The "ruler" checks: "Did the animal start on the left side of the graph? Did it end on the right side? Did it move smoothly in between?" It gives a score from 0 to 1 based on pure geometry, not guesswork.

3. The Training: The "Taste Test" (DPO)

Now that they have a ruler to measure the videos, they need to teach the AI to make better ones.

The Old Way (Supervised Fine-Tuning): This is like showing the AI a thousand perfect videos and saying, "Copy this." The problem is the AI might just memorize the pictures without learning the rule. It's like a student memorizing the answer key instead of learning math.
The New Way (Direct Preference Optimization - DPO): This is more like a taste test.
1. The AI generates 10 different videos for the same prompt.
2. The "Ruler" (DSR-SCORE) grades them.
3. The AI is shown the "Winner" (the video that moved correctly) and the "Loser" (the video that failed).
4. The AI is told: "You like the Winner more than the Loser. Adjust your brain so you make more Winners."
- Analogy: It's like training a dog. You don't just show the dog a picture of a "sit." You wait for it to sit, give it a treat (the "Winner" signal), and ignore it when it stands up (the "Loser" signal). Over time, the dog learns the behavior.

4. The Secret Sauce: "Zeroth-Order Regularization"

There was a catch. When the AI tried to learn from these "Winners" and "Losers," it got a bit crazy. It started making videos that followed the direction perfectly but looked weird (like the colors were too bright or the animal looked like a blob). It was "cheating" to get the score.

The authors added a safety net (Zeroth-Order Regularization).

Analogy: Imagine the AI is a student taking a test. The "Winner/Loser" method tells them which answers are right. But the safety net says, "Don't change your handwriting or your style just to get the right answer; keep your natural look." It keeps the video looking natural while still teaching it the spatial rules.

The Result

After this training, the AI became a master of direction.

Before: "A fox is on the left of a stump, then moves to the right." -> The AI made the fox stay in the middle or move the wrong way.
After: The AI generates a video where the fox clearly starts on the left and walks smoothly to the right, exactly as requested.

Why This Matters

This isn't just about foxes and stumps. It's about teaching AI to understand physics and logic in the real world. If we want AI to help robots navigate a room, or simulate how objects interact, the AI needs to understand that "left" is different from "right," and that things move in space. SPATIALALIGN gives AI a brain for spatial reasoning, not just a pretty face.

1. Problem Definition

The paper addresses a critical limitation in current Text-to-Video (T2V) generation models: while they excel at aesthetic quality, they often fail to accurately follow Dynamic Spatial Relationship (DSR) instructions.

The Task: Generating videos where an object (typically an animal) moves relative to a static object, changing its spatial relationship over time as described in a text prompt (e.g., "A fox is on the right of a stump, then moves to the left").
The Challenge: State-of-the-art models (e.g., Wan2.1, CogVideoX) frequently ignore these spatial constraints, producing videos where the object stays in place, moves in the wrong direction, or fails to transition correctly.
Evaluation Gap: Existing evaluation methods rely heavily on Vision-Language Models (VLMs) to judge spatial correctness. The authors demonstrate that current VLMs lack the reliability and fine-grained spatial reasoning necessary to evaluate dynamic spatial changes accurately, often providing false positives even when the spatial relationship is incorrect.

2. Methodology: SPATIALALIGN

The authors propose SPATIALALIGN, a self-improvement framework that fine-tunes pre-trained T2V models to better align with DSR instructions. The framework consists of three core components:

A. DSR-SCORE: A Geometry-Based Evaluation Metric

Instead of relying on VLMs, the authors introduce DSR-SCORE, a metric grounded in geometric principles to quantify spatial alignment.

Static Spatial Relationship (SSR) Score: For each frame, the system uses an off-the-shelf object detector and tracker (GroundedSAM) to extract bounding boxes for the animal and the static object. It calculates a score based on:
1. Normalized Distance: The distance between the centers of the bounding boxes along the relevant axis (x for left/right, y for top).
2. Cosine Distance: The alignment of the vector between the objects with the target axis to ensure the direction is correct.
Dynamic Aggregation: The DSR-SCORE for the entire video is derived from the sequence of SSR scores. It specifically looks for a "crossing pattern":
- The score for the initial spatial relationship should decrease over time.
- The score for the final spatial relationship should increase over time.
- The final score aggregates the average scores at the start/end of the video and the magnitude of the transition (gap).

B. Data Curation

The authors generate a dataset of video samples using a reference T2V model based on diverse DSR prompts.
Filtering: Invalid samples (e.g., missing objects, multiple animals) are filtered out using the tracker.
Labeling: Videos are labeled as "Winners" or "Losers" based on a threshold ( $\tau_{train}$ ) applied to their DSR-SCORE.

C. Training Strategy: Zeroth-Order Regularized DPO

The core training mechanism is Direct Preference Optimization (DPO), adapted for diffusion models.

Why DPO? DSR-SCORE is a non-differentiable numeric signal, making standard Supervised Fine-Tuning (SFT) or gradient-based RL (like PPO) difficult or inefficient. DPO allows optimization using preference pairs (Winner/Loser) without requiring differentiable rewards.
The Problem with Standard DPO: The authors observed that applying standard DPO to this task causes likelihood displacement, where the model degrades the quality of both winners and losers to satisfy the margin constraint, leading to visual artifacts (e.g., color saturation).
The Solution (Zeroth-Order Regularization): To prevent the model from "hacking" the reward by deviating too far from the reference distribution, they introduce a Zeroth-Order Regularization term ( $L_{ZO}$ ).
- $L_{ZO}$ penalizes the deviation of the fine-tuned model's noise prediction from the reference model's prediction on both winner and loser samples.
- This acts as an anchor, ensuring the model improves spatial alignment without sacrificing the fundamental generation quality or identity of the objects.

3. Key Contributions

DSR-SCORE: A novel, geometry-based metric that provides a reliable, fine-grained, and interpretable evaluation of dynamic spatial relationships, outperforming VLM-based evaluators.
SPATIALALIGN Framework: A DPO-based training strategy enhanced with Zeroth-Order Regularization. This is the first work to apply DPO specifically for improving DSR reasoning in T2V generation, effectively bypassing the need for real-world video datasets or differentiable reward functions.
DSR-DATASET: A new benchmark dataset containing controlled text-video pairs with diverse spatial transitions (Left-to-Right, Top-to-Left, etc.) for training and evaluation.

4. Experimental Results

The method was evaluated on multiple state-of-the-art T2V models (Wan2.1, CogVideoX, OpenSora, LTX-Video, HunyuanVideo).

Quantitative Performance:
- On the DSR-DATASET, the fine-tuned Wan2.1-1.3B model achieved a Correctness@0.7 score of 0.585, significantly outperforming the baseline (0.125) and other SOTA models (which ranged from 0.018 to 0.490).
- The model maintained high ID Consistency (object identity preservation) and Visual Quality (CLIP-IQA, Imaging Quality), proving that the spatial improvements did not come at the cost of aesthetic degradation.
Qualitative Results: Visual comparisons show that the baseline models often fail to move the object or move it in the wrong direction. SPATIALALIGN successfully generates videos where the animal transitions smoothly and accurately between the specified spatial positions.
Ablation Studies:
- Reward System: Training with VLM-based rewards resulted in lower spatial correctness than the baseline, confirming the unreliability of VLMs for this specific task.
- Loss Function: The combination of DPO and Zeroth-Order Regularization ( $L_{ZO}$ ) was crucial. Using SFT regularization led to color saturation issues, while pure DPO caused training instability.
- Threshold Sensitivity: A higher threshold for defining winners ( $\tau_{train}$ ) improved spatial correctness but reduced the amount of training data, highlighting a trade-off.

5. Significance and Conclusion

SPATIALALIGN represents a significant step forward in making generative video models physically grounded and instruction-following.

Beyond Static Control: Unlike previous works that rely on bounding boxes or masks for static image generation, this work tackles the complexity of dynamic spatial changes in video.
Reliable Evaluation: By moving away from VLMs for spatial evaluation, the paper establishes a more robust, geometric foundation for measuring and optimizing spatial reasoning.
Generalizability: The approach of using a non-differentiable, geometry-based metric with DPO and regularization offers a general recipe for aligning generative models with complex physical constraints (e.g., physics, object permanence) beyond just spatial relationships.

In summary, the paper demonstrates that with a reliable geometric reward signal and a carefully regularized preference optimization strategy, T2V models can be effectively aligned to understand and execute complex dynamic spatial instructions.