Open-World Task and Motion Planning via Vision-Language Model Genereated Constraints

Imagine you have a very smart, well-read robot assistant. This robot has two distinct personalities, and the paper you're reading is about how to get them to work together perfectly to solve complex chores.

The Two Personalities

The "Big Picture" Thinker (The Vision-Language Model):
Think of this as a highly educated human who has read every book and seen every movie. If you say, "Put the banana near the other fruit," this thinker instantly understands the vibe. They know what "near" means, they know bananas are slippery, and they understand the concept of "fruit."
- The Flaw: They are terrible at math and physics. If you ask them to actually move the banana, they might say, "Just grab it!" without realizing the banana is stuck behind a heavy milk carton, or that the robot's arm can't twist that way without breaking. They dream in concepts, not coordinates.
The "Strict Engineer" (The TAMP System):
Think of this as a rigid, by-the-book construction foreman. They are amazing at math, geometry, and avoiding collisions. They know exactly how to move a robot arm so it doesn't hit a wall.
- The Flaw: They are incredibly literal and lack common sense. They only understand a fixed list of commands like "Move Left" or "Place On Top." If you tell them to "Put the banana near the fruit," they get confused because "near" isn't in their dictionary. They can't imagine new scenarios; they can only do what they were explicitly programmed to do.

The Problem: The "Banana Dilemma"

In the past, if you asked a robot to "Put the banana near the other fruit," it would fail.

If you used only the Thinker, the robot would try to grab the banana, but its arm would crash into the milk carton because the Thinker didn't calculate the geometry.
If you used only the Engineer, the robot would say, "I don't know what 'near' means. I can only place things 'on top' or 'underneath'." It would refuse to do the task.

The Solution: OWL-TAMP (The Great Translator)

The authors created a system called OWL-TAMP (Open-World Language-based Task and Motion Planning). Think of it as a translator and a contract writer that sits between the Thinker and the Engineer.

Here is how it works, using a creative analogy:

1. The Sketch (The Thinker's Job)

You tell the system: "Put the banana near the other fruit."
The Thinker (VLM) steps in and draws a rough sketch of the plan. It says:

"Okay, first, we need to move the milk carton out of the way. Then, pick up the banana. Finally, place it down... but not just anywhere. It needs to be near the apple and pear."

The Thinker doesn't know exactly where "near" is in inches, but it knows the order of operations and the intent.

2. The Contract (The Translator's Job)

This is the magic part. The system takes the Thinker's vague idea ("near the apple") and turns it into a strict legal contract written in code.

The Thinker says: "Place it near the apple."
The System translates this into a Python function: "The banana's final position must be within 5 centimeters of the apple's position."

Now, the Thinker's vague idea has become a hard, mathematical rule that the Engineer can understand.

3. The Execution (The Engineer's Job)

The Engineer (TAMP) takes this new contract. It looks at the robot's arm, the milk carton, and the banana.

It sees the milk carton blocking the path.
It calculates: "I cannot grab the banana yet. I must move the milk carton first."
It runs a complex simulation to find the perfect angle to grab the banana, move it, and place it exactly 5 centimeters from the apple, ensuring it doesn't crash into anything.

Why is this a big deal?

Before this paper:
Robots were like a dog that only knows three commands: "Sit," "Stay," and "Fetch." If you asked it to "Go get the mail," it would stare at you blankly because "mail" wasn't in its vocabulary.

With OWL-TAMP:
The robot is like a smart intern.

You give it a natural language goal ("Clean the table, but leave the red cup alone").
The "Thinker" part figures out the strategy and writes a checklist with specific rules.
The "Engineer" part executes the checklist with perfect precision, moving obstacles and calculating angles.

The Real-World Test

The authors didn't just test this in a computer simulation. They put it on a real robot arm in a lab. They gave it 19 different weird, complex tasks, like:

"Put the shortest object in the bin."
"Stack the blocks by color."
"Fry the spam and serve it on the plate."

The robot successfully figured out that it needed to move other objects out of the way, figured out which block was the shortest, and even figured out how to orient a cup so it wouldn't spill. It did all this without being specifically programmed for those exact tasks.

The Bottom Line

This paper solves the "Common Sense vs. Precision" problem. It teaches robots how to listen to human language (which is fuzzy and full of "near" and "straight") and translate it into strict mathematical rules (which are precise and safe), allowing them to tackle open-ended, real-world chores for the first time.

Here is a detailed technical summary of the paper "Open-World Task and Motion Planning via Vision-Language Model Generated Constraints" (OWL-TAMP).

1. Problem Statement

Robotic manipulation faces a fundamental challenge in solving complex, long-horizon tasks specified in natural language.

Limitations of Vision-Language Models (VLMs): While VLMs excel at common sense reasoning and interpreting language (e.g., "put the banana near the other fruit"), they struggle with precise continuous reasoning, such as generating collision-free trajectories, stable grasps, and geometric constraints without task-specific fine-tuning.
Limitations of Traditional Task and Motion Planning (TAMP): TAMP systems can handle long-horizon reasoning by searching over discrete action sequences and continuous parameters (poses, grasps). However, they rely on rigid, pre-defined symbolic vocabularies (predicates and operators). They cannot interpret novel concepts (e.g., "near," "oriented straight," or "shortest object") unless explicitly engineered, making them brittle in open-world scenarios.

The Core Problem: How to combine the semantic flexibility of foundation models with the rigorous geometric reasoning of TAMP to solve open-world manipulation tasks without requiring manual predicate engineering or task-specific training data.

2. Methodology: OWL-TAMP

The authors propose OWL-TAMP (Open-World Language-based TAMP), a framework that integrates VLMs into standard TAMP pipelines by generating constraints rather than direct plans. The system operates in three stages:

A. Open-World Actions and Predicates

The authors extend standard TAMP representations:

Open-World Actions: Standard parameterized actions (e.g., attach, detach) are augmented with a natural language description parameter ( $d$ ). This description constrains the valid continuous parameter space (e.g., detach("place banana near apple", ...)).
Open-World Predicates: Instead of fixed classifiers, these predicates are dynamically generated at planning time. A VLM generates a Python function (classifier) that takes continuous parameters and the language description as input, returning a boolean indicating if the constraint is satisfied.

B. Three-Stage Pipeline

Discrete Constraint Generation (Plan Sketch):
- The system first grounds the set of reachable actions and literals from the initial state using relaxed planning.
- A VLM is prompted with the initial image, the natural language goal, and the list of reachable actions to generate a plan sketch.
- This sketch is a partial sequence of actions with natural language descriptions (e.g., [pick banana, place banana near apple]).
- Crucially, the TAMP planner is constrained to treat this sketch as a subsequence. It must find a valid plan that includes these actions in the specified order, but it can insert additional actions (like moving obstacles) to ensure feasibility.
Continuous Constraint Generation (Code Generation):
- For each open-world predicate in the plan sketch, the VLM generates a Python function (e.g., test_banana_pose(p)) that implements the semantic constraint.
- The VLM is provided with a "codebook" of helper functions (e.g., modify_pose_bounds_to_be_near_object, get_aabb_bounds) to compose complex geometric checks.
- These functions act as continuous constraints during the TAMP search, filtering out invalid poses or grasps that do not satisfy the language intent.
TAMP Solving:
- An off-the-shelf TAMP solver (e.g., PDDLStream or SeSaME) performs a search-then-sample strategy.
- It searches for a discrete action sequence that satisfies the plan sketch constraints.
- For each discrete step, it samples continuous parameters (poses, grasps) and validates them against both standard robotic constraints (kinematics, collision) and the VLM-generated continuous constraints.
- If a plan fails (e.g., a grasp is blocked), the system backtracks to explore alternative action sequences (e.g., moving an obstructing object first) until a valid plan is found.

3. Key Contributions

Constraint-Based Integration: The paper introduces a clear "contract" between VLMs and TAMP systems. Instead of asking VLMs to output full plans (which often fail on geometry), VLMs output constraints (discrete ordering and continuous code) that guide the search.
Open-World Reasoning: The system enables TAMP to reason about concepts not in its pre-defined vocabulary (e.g., "near," "shortest," "upright") by dynamically generating the logic to evaluate these concepts.
Zero-Shot Generalization: The approach requires no fine-tuning of the VLM or additional robot demonstration data. It works with off-the-shelf TAMP solvers and pre-trained VLMs (GPT-4o).
Real-World Deployment: The system was successfully deployed on a dual-arm robot (Kinova Gen3) to solve 19 complex, natural-language manipulation tasks in the real world, demonstrating robustness to perception noise and physical constraints.

4. Experimental Results

The authors evaluated OWL-TAMP on 10 simulated tasks (RAVENS-YCB environment) and 19 real-world tasks.

Simulation Performance:
- Success Rate: OWL-TAMP achieved the highest success rate (92% overall) across 10 diverse tasks, significantly outperforming baselines like pure VLM approaches (Code as Policies), direct translation to symbolic goals, and ablated versions of OWL-TAMP (e.g., removing discrete or continuous constraints).
- Soundness: OWL-TAMP had a near-perfect soundness rate (98.6%), meaning it rarely declared success when the task was actually failed. Baselines like "Direct Translation" often produced false positives by ignoring geometric constraints.
- Ablation Studies: Removing either discrete constraints (action ordering) or continuous constraints (geometric code) led to significant performance drops, proving the necessity of both components.
Real-World Performance:
- OWL-TAMP successfully completed all 19 real-world tasks, including complex scenarios like "fry the spam and serve it," "weigh the shortest object," and "put the banana near the other fruit" (requiring obstacle removal).
- The system handled dynamic re-planning when initial plans failed due to physical constraints (e.g., unreachable grasps).

5. Significance and Impact

Bridging the Gap: OWL-TAMP effectively bridges the gap between the semantic understanding of foundation models and the geometric rigor required for physical robotics.
Scalability: By using constraints rather than end-to-end planning, the system scales to long-horizon tasks where direct VLM planning is prone to hallucination or geometric infeasibility.
Modularity: The approach is modular; it can be plugged into existing TAMP systems without modifying the underlying planner, making it a practical solution for deploying flexible robots in unstructured environments.
Future Direction: The work highlights that the future of robotic planning lies in using foundation models to augment search-based planners with dynamic, task-specific constraints rather than replacing them entirely.

Limitations: The system relies on the VLM to generate syntactically and semantically correct code. If the VLM generates an unsatisfiable constraint (e.g., a geometrically impossible pose), the planner may fail to find a solution. Additionally, the system currently does not recover from errors in the generated constraints themselves without human intervention or replanning policies.