IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Imagine you are trying to grab a jar of spices from the very back of a messy kitchen cabinet. The shelf is packed tight with a heavy glass vase, a stack of fragile bowls, and a soft, squishy teddy bear.

The Old Way (Traditional Robots):
A traditional robot is like a nervous, rule-abiding librarian. Its only rule is: "Do not touch anything."
To get the spice jar, the librarian-robot would try to find a path that weaves perfectly between the objects without brushing against a single one. If the shelf is too crowded, the robot gives up, saying, "Impossible! I can't get there without breaking something." It might try to lift its arm high over the clutter, but in a tight cabinet, there's no room to go up.

The New Way (IMPACT):
The paper introduces IMPACT, a robot that acts more like a clever, experienced human moving through a crowded room. It knows that sometimes, to get what you need, you have to nudge things out of the way. But it also knows the difference between a soft pillow and a crystal vase.

Here is how IMPACT works, broken down into simple steps:

1. The "Common Sense" Brain (Vision-Language Models)

First, IMPACT looks at the messy shelf and asks a super-smart AI (called a Vision-Language Model, or VLM) for advice. Think of this AI as a wise grandparent who has seen thousands of objects.

The AI looks at the wine glass and says, "That's fragile! Give it a high 'danger score'."
It looks at the teddy bear and says, "That's soft and squishy. Give it a low 'danger score'."
It looks at the spice jar (the goal) and says, "That's the target! Give it a negative score (a reward)."

2. The "Push Map" (Anisotropic Cost Map)

This is the clever part. Just knowing an object is "safe" isn't enough; you need to know how to push it.
Imagine the teddy bear is a heavy box. If you push it from the side, it might slide nicely into a corner. If you push it from the front, it might tip over and knock over the wine glass.
IMPACT creates a special 3D map that doesn't just say "Teddy Bear = Safe." It says:

"Pushing the bear from the left is safe."
"Pushing the bear from the right is risky."
"Pushing the vase in any direction is a disaster."

This map is called "anisotropic," which is a fancy way of saying the safety depends on the direction you are coming from.

3. The "Smart Navigator" (Contact-Aware A*)

Now, the robot uses a GPS-like planner to find a route.

Traditional robots try to draw a straight line that never touches anything.
IMPACT draws a path that says: "I will gently nudge the teddy bear to the left (because the map says that's safe), slide past the wine glass without touching it, and grab the spice jar."

It calculates the "cost" of every move. Pushing the bear costs very little. Hitting the vase costs a million points. The robot finds the path with the lowest total cost, even if that path involves a little bit of contact.

Why This Matters

In the real world, things are rarely perfectly organized.

Old robots get stuck in cluttered rooms because they are too afraid to touch anything.
IMPACT is like a person who can shuffle a pile of laundry to get to the shirt underneath, or slide a couch slightly to walk past it, without knocking over the lamp.

The Results

The researchers tested this in computer simulations and with a real robot arm in a lab.

Success Rate: IMPACT successfully grabbed the target objects much more often than the "no-touch" robots.
Human Preference: When humans watched videos of the robots, they preferred IMPACT. They felt the robot was being "smart" and "gentle" rather than clumsy or overly cautious.
Safety: It successfully avoided breaking fragile items (like the wine glass) while moving the soft ones (like the teddy bear).

In a Nutshell

IMPACT teaches robots to stop being afraid of touching things and start using common sense. It understands that not all collisions are bad; some are just a necessary part of getting the job done, as long as you know what to touch and how to push it. It turns a messy, impossible task into a manageable one by knowing the difference between a "soft nudge" and a "hard crash."

Here is a detailed technical summary of the paper "IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models."

1. Problem Statement

Traditional robot motion planning operates under a strict collision-free constraint. While effective in open spaces, this approach often fails in densely cluttered environments where a collision-free path to a target does not exist or is computationally infeasible (e.g., requiring overly long, parabolic trajectories).

The paper addresses the challenge of contact-rich manipulation, where a robot must intentionally make contact with non-target "distractor" objects to reach a goal. The core difficulty lies in defining "acceptable contact":

Semantic Variability: Some objects (e.g., a soft pillow) can be pushed safely, while others (e.g., a glass vase) are fragile and dangerous to touch.
Directional Sensitivity: The safety of contact often depends on the direction of approach. Pushing a box from the side might be safe, while pushing it from the top could topple it.
Lack of Explicit Instructions: Existing methods often require explicit human language commands (e.g., "you may push the cup") to define acceptable contacts, which is not scalable.

Goal: Develop a motion planning framework that autonomously infers which objects can be touched, determines safe directions for contact, and generates trajectories that reach a target while minimizing damage to fragile obstacles.

2. Methodology: IMPACT Framework

The proposed IMPACT framework consists of two primary stages: Semantic Cost Inference and Directional Contact-Aware Planning.

A. Semantic Cost Inference via Vision-Language Models (VLMs)

Instead of hard-coding object properties, IMPACT leverages the commonsense knowledge of VLMs (specifically GPT-4o) to estimate object tolerance.

Input: An annotated RGB-D image of the scene (segmented using SAM2) and a text prompt describing the objects and the concept of "contact tolerance."
Process: The VLM assigns an integer safety cost (0–10) to each object.
- Low Cost (e.g., 0–3): Robust objects (e.g., toy bear, foam) that can be pushed.
- High Cost (e.g., 8–10): Fragile objects (e.g., wine glass, stacked bowls).
- Target: Assigned a cost of -1 to encourage the planner to reach it.
Output: A dictionary of object costs used to initialize the planning map.

B. Directional Contact-Aware Motion Planning

The framework transforms static object costs into a dynamic, anisotropic cost map to guide an A* planner.

3D Voxel Grid & 2D Projection:
- A 3D voxel grid is created where each voxel holds the object's cost.
- This is projected into a 2D top-down grid ( $M$ ) where $M[x,y]$ represents the maximum cost at that location.
Anisotropic Cost Map Generation ( $M'$ ):
- Standard cost maps are isotropic (direction-agnostic). IMPACT introduces directional safety.
- For low-cost objects, the system samples multiple push outcomes (varying distance and angle) relative to the object's surface normal.
- It simulates whether a push in a specific direction causes a cascade of collisions with high-cost objects.
- A safety score ( $f_s$ ) is calculated based on the probability of a "safe" push outcome.
- The final anisotropic map $M'$ is a weighted combination of the original object cost and the directional safety score:
  $M'[x, y] = \alpha M[x, y] + (1 - \alpha)[10 - 10f_s(x, y)]$
- Result: The cost of an obstacle varies depending on the direction the robot approaches it.
Contact-Aware A* Planner:
- The planner searches a state space $S = (p, r, D)$ , where $p$ is position, $r$ is orientation, and $D$ tracks the cumulative displacement of low-cost objects.
- Motion Primitives:
  - Move: Standard translation.
  - Rotate: Orientation change.
  - Push: Translates the end-effector while contacting an object, updating the world state ( $D$ ) based on the push direction and object properties.
- Cost Function: The path cost $g(s)$ includes the action cost (derived from $M'$ ) and a penalty for gripper placement near high-cost objects.

3. Key Contributions

IMPACT Framework: A novel system that formalizes "acceptable contact" by converting VLM-inferred semantic costs into a dense, anisotropic cost map. This allows the robot to reason about how and where to push objects safely.
Contact-Aware A* Planner: An extension of the A* algorithm that incorporates world state changes (object displacement) and directional safety scores to execute minimal-impact contact trajectories.
Zero-Shot Generalization: Unlike prior methods requiring fine-tuning or explicit language instructions for every new object, IMPACT uses VLMs in a zero-shot setting to infer contact tolerances for novel objects.
Comprehensive Evaluation: Validation across 3,200 simulation trials and 200 real-world trials, including a human subject study to assess subjective "acceptability."

4. Experimental Results

The authors evaluated IMPACT against baselines including Collision-Free planners (RRT, RRT*, A*), a method using VLM costs without directional analysis, and LAPP (Language-Conditioned Path Planning).

Simulation Performance:
- Success Rate: IMPACT achieved a 78.00% success rate, significantly outperforming Collision-Free baselines (~20–28%) and LAPP (50%).
- Safety: IMPACT resulted in lower path costs, shorter contact durations, and significantly less displacement of "unsafe" objects compared to baselines.
- Ablation: Removing VLM costs (setting all costs to 0) reduced success rates, proving the necessity of semantic differentiation between objects.
Real-World Performance:
- Tested on a Franka Panda arm with 10 real-world scenes.
- Success Rate: IMPACT achieved 61% success, outperforming LAPP (49% on seen objects, 40% on unseen).
- Generalization: IMPACT maintained high performance on novel objects without fine-tuning, whereas LAPP struggled with unseen objects.
Human Evaluation:
- In a user study with 25 participants, humans preferred IMPACT trajectories over alternatives in the majority of comparisons.
- Users favored trajectories that utilized "gentle" contact with robust objects over those that either failed to reach the target or caused chaotic collisions.

5. Significance and Impact

Bridging Semantics and Control: IMPACT successfully bridges high-level semantic understanding (via VLMs) with low-level motion planning, enabling robots to operate in environments previously considered "too cluttered" for automation.
Redefining Safety: It shifts the paradigm from "avoid all contact" to "manage contact intelligently," allowing robots to navigate dense domestic or industrial settings (e.g., cluttered cabinets, shelves) more efficiently.
Scalability: By leveraging the generalization capabilities of VLMs, the approach reduces the need for extensive manual labeling or retraining for new environments and objects.
Future Directions: The paper identifies limitations such as open-loop execution (lack of real-time reaction to disturbances) and reliance on complete RGB-D observations, pointing toward future work in closed-loop perception and active sensing.

In summary, IMPACT demonstrates that robots can be taught to "push through" clutter intelligently, using AI to distinguish between a toy bear that can be moved and a wine glass that must be avoided, thereby expanding the operational envelope of robotic manipulation.