EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Imagine you want to teach a robot to perform a delicate task, like stacking blocks or pouring water, but you don't want to spend months training it on that specific job. You just want to give it a verbal command and let it figure it out. This is called Zero-Shot Manipulation.

The paper introduces a new system called EmboAlign that solves a major problem in this field. Here is the simple breakdown using everyday analogies.

The Problem: The "Dreamer" vs. The "Realist"

To get a robot to move, researchers have been using two different types of AI tools:

The Dreamer (Video Generative Model - VGM):
Think of this AI as a Hollywood director who has watched millions of movies. If you tell it, "Stack the green block on the red one," it can instantly generate a beautiful, smooth video of exactly how that should look.
- The Catch: Because it learned from movies, it sometimes "hallucinates." It might make the block float, pass through the table, or disappear mid-air. It looks great on screen, but if you tried to do that in real life, physics would break.
The Realist (Vision-Language Model - VLM):
Think of this AI as a strict physics teacher or a safety inspector. It doesn't generate videos; instead, it reads the instructions and understands the rules. It knows things like: "The red block must stay still," "The green block must come from above," and "Nothing can melt or vanish."
- The Catch: The Realist is great at knowing the rules, but it's bad at imagining the actual movement. If you ask it to plan the motion from scratch, it often gets stuck or comes up with a clumsy, inefficient path.

The Old Way: Researchers tried to take the Dreamer's video and force the robot to copy it. But because the video had "movie magic" (physics errors) and the robot's sensors aren't perfect, the robot would crash or fail.

The Solution: EmboAlign (The "Editor" and "Coach")

EmboAlign is a new framework that acts as a bridge between the Dreamer and the Realist. It uses the Realist (the VLM) to check the Dreamer's work in two specific stages.

Stage 1: The Script Editor (Rollout Selection)

Imagine the Dreamer (VGM) is an actor who improvises 10 different takes of the scene.

Without EmboAlign: You might pick the most dramatic take, even if the actor walks through a wall.
With EmboAlign: The Realist (VLM) acts as a Script Editor. It looks at all 10 takes and says:
- "Take #4 is bad; the block disappeared."
- "Take #7 is bad; the block went through the table."
- "Take #2 is perfect; it follows all the physics rules."
- Result: The system discards the bad videos and only keeps the one that actually makes sense in the real world.

Stage 2: The Coach (Trajectory Optimization)

Now that you have the "perfect" video (Take #2), you need to translate it into robot arm movements. This is like translating a dance video into instructions for a clumsy robot.

The Problem: Even the best video has tiny errors when you try to copy it. Maybe the depth looks slightly off, or the angle is wrong. If the robot tries to copy it exactly, it might miss the block.
The Solution: EmboAlign acts as a Coach. It takes the robot's initial attempt (based on the video) and runs a "correction drill." It uses the Realist's rules (e.g., "Stay 5cm above the table," "Don't hit the bottle") to nudge the robot's movements.
- It says, "You're 2cm too low; move up."
- "You're about to hit the bottle; turn left."
- Result: The robot's final movement is a refined, safe version of the original video idea.

Why This Matters

The paper tested this on a real robot with six different tasks, like stacking blocks, pressing a stapler, and pouring water.

The Result: The system improved the robot's success rate by 43% compared to the best previous methods.
The Magic: It did this without needing to retrain the robot on any new data. It just used the "Dreamer" to imagine the motion and the "Realist" to ensure it was safe and physically possible.

Summary Analogy

Imagine you are trying to teach a child to ride a bike.

The VGM is a video of an Olympic cyclist doing a perfect trick. It's inspiring, but if the child tries to copy it exactly, they might crash because the video ignores the child's balance.
The VLM is a coach who knows the rules of balance and safety.
EmboAlign is the coach watching the Olympic video, picking the safest version of the trick, and then guiding the child step-by-step to ensure they don't fall, correcting their balance in real-time.

By combining the creativity of video generation with the logic of physical constraints, EmboAlign allows robots to perform complex, precise tasks instantly, just by listening to a human instruction.

Here is a detailed technical summary of the paper "EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation."

1. Problem Statement

The paper addresses the challenge of zero-shot robotic manipulation, where a robot must execute diverse tasks without task-specific retraining. While recent Video Generative Models (VGMs) trained on large-scale internet data can produce temporally coherent videos of object dynamics, they suffer from two critical failure modes when applied to robotics:

Physical Hallucinations: VGMs often generate physically implausible rollouts (e.g., object interpenetration, non-conservative motion, or objects disappearing) because they lack a grounded understanding of physics.
Retargeting Errors: Converting pixel-space video motion into robot actions via geometric retargeting (depth estimation and keypoint tracking) introduces cumulative errors, leading to execution failures even if the video looks visually plausible.

Current pipelines lack mechanisms to enforce the compositional constraints (spatial relations, kinematic requirements, safety conditions) necessary for successful manipulation.

2. Methodology: EmboAlign

EmboAlign is a data-free framework that aligns VGM outputs with constraints generated by Vision-Language Models (VLMs) at inference time. It leverages the complementary strengths of VGMs (rich motion priors) and VLMs (structured spatial reasoning). The pipeline operates in four main stages:

A. Compositional Constraint Generation

Given an initial RGB-D observation and a language instruction, a VLM automatically decomposes the task into a set of compositional constraints ( $\mathcal{C}$ ).

Representation: Tasks are represented by sparse 3D keypoints.
Constraint Types: The VLM generates Python functions that map keypoint configurations to scalar costs. These include:
- Goal-state conditions: e.g., "Block A must be on Block B."
- Process-level requirements: e.g., "Approach from above," "No object deformation."
- Safety conditions: e.g., "Avoid obstacles."

B. Constraint-Guided Rollout Selection

Instead of using a single VGM output, the system samples $N$ candidate rollout videos. It selects the best candidate using a two-stage filtering process:

Visual Plausibility Scoring: A latent world model (V-JEPA-2) scores videos based on physical coherence. It predicts future latent representations; high divergence indicates hallucinations (e.g., morphing objects).
Spatial Constraint Scoring: For the top-ranked visual candidates, the system reconstructs 3D trajectories from the video (using keypoint tracking and depth estimation) and calculates a constraint violation cost.
Selection: The first rollout that satisfies both visual coherence and a low constraint violation cost ( $\le \epsilon$ ) is selected.

C. Grasp-Conditioned Retargeting

The selected video is converted into an initial end-effector trajectory ( $\xi^{(0)}$ ):

Grasp Estimation: A grasp predictor (AnyGrasp) and 3D reconstruction (SAM 3D) determine a stable grasp pose ( $T_{grasp}$ ).
Motion Retargeting: Assuming a fixed transform between the gripper and object, the object's 3D motion in the video is mapped to the robot's end-effector pose sequence.

D. Constraint-Based Trajectory Optimization

The initial trajectory is refined to correct retargeting errors and ensure strict adherence to constraints. A nonlinear optimization problem is solved:
$\xi^*_{1:T} = \arg \min_{\xi_{1:T}} \sum_{c \in \mathcal{C}} \sum_{t=1}^T [\max(0, c(k_t))]^2 + \lambda \sum_{t=1}^T \|\xi_t - \xi^{(0)}_t\|^2$

Term 1: Penalizes constraint violations (hard/soft constraints).
Term 2: Maintains fidelity to the original VGM motion prior (preventing drift).
Solver: Sequential Least Squares Programming (SLSQP) is used, initialized with the retargeted trajectory.

3. Key Contributions

Novel Framework (EmboAlign): A unified pipeline that aligns VGM rollouts with VLM-derived compositional constraints, enabling precise zero-shot execution without modifying pre-trained model weights.
Two-Stage Alignment Mechanism:
- Selection: Filters out physically implausible VGM samples before execution.
- Optimization: Corrects geometric retargeting errors in real-time using the same constraint set.
Complementary Integration: Demonstrates that VGMs provide diverse motion priors to solve initialization sensitivity, while VLMs provide the physical grounding to ensure safety and feasibility.

4. Experimental Results

The framework was evaluated on six real-robot manipulation tasks (e.g., stacking blocks, pressing a stapler, pouring water) using a Dobot Nova2 robot.

Performance: EmboAlign achieved an average success rate of 68.3%, a significant improvement over the strongest baselines:
- ReKep (Constraint-only): 21.7% (Struggles with initialization/local minima).
- NovaFlow (Video-only): 25.0% (Suffers from physical hallucinations and retargeting errors).
Ablation Studies:
- Removing constraints (Video-only) resulted in a 23.3% success rate.
- Removing video guidance (Constraint-only) resulted in a 28.3% success rate.
- Adding Selection alone improved performance to 48.3%, and adding Optimization (full method) reached 68.3%.
Failure Analysis: The primary failure modes were Video Generation Quality (31.57%) and VLM Keypoint Referring errors (26.31%), indicating that while the alignment helps, the underlying generative models still have limits in precise physics simulation.

5. Significance

EmboAlign represents a principled approach to bridging the gap between internet-scale generative AI and real-world robotic control.

Zero-Shot Capability: It eliminates the need for task-specific training data or fine-tuning of the VGM.
Safety and Precision: By enforcing compositional constraints, it mitigates the "black box" nature of generative models, ensuring that generated plans are physically safe and geometrically feasible.
Scalability: The method is applicable to a wide range of tasks simply by changing the language instruction, as the VLM dynamically generates the necessary constraints.

In summary, EmboAlign demonstrates that combining the diversity of video generation with the rigor of constraint-based optimization is a highly effective strategy for advancing generalizable robotic manipulation.