Direct Contact-Tolerant Motion Planning With Vision Language Models

Imagine you are trying to walk through a crowded, messy room to get to the kitchen. In the middle of the floor, there are a few things: a heavy, solid bookshelf, a pile of empty cardboard boxes, and a curtain hanging from the ceiling.

The Old Way (Traditional Robots):
Most robots today are like people with a very strict rule: "Never touch anything!" They treat every object as if it were a solid wall.

If they see a box, they try to go around it.
If the path is blocked by a box and a curtain, they get stuck because they can't find a "collision-free" path.
They rely on a pre-drawn map, which is like trying to navigate a messy room using a blueprint from 10 years ago. It doesn't know the boxes are there, or that the curtain is soft.

The New Way (This Paper's Solution - DCT):
This paper introduces a robot that is smarter and more adaptable. It's like giving the robot a super-intelligent assistant (a Vision-Language Model, or VLM) and a fast reflex system.

Here is how it works, broken down into simple steps:

1. The "Smart Eye" (VLM Point Cloud Partitioner)

Imagine the robot has a camera and a brain that can talk to you.

The Question: The robot sees a box and asks its "brain": "Hey, is this box heavy? Can I push it?"
The Answer: The brain looks at the image and says, "That's a small, empty box. It's light. You can push it. But that curtain is heavy and might tangle you, so avoid it."
The Memory Trick: Since the "brain" is slow to think, the robot doesn't ask it about every single pixel every second. Instead, it asks once, gets the answer, and remembers it. As the robot moves, it projects that memory onto the new view, like a ghostly outline showing which parts of the floor are "safe to touch" and which are "dangerous."

2. The "Fast Reflexes" (VGN Navigation)

Once the robot knows what it can touch, it needs to move fast.

The Problem: Calculating how to move around thousands of individual points (like every pixel of a box) is too slow for a computer to do with math equations in real-time.
The Solution: The robot uses a trained "muscle memory" network (a Deep Neural Network). Think of this like a professional driver who doesn't calculate the physics of every turn; they just know how to steer. This network was trained to instantly figure out the best path, allowing the robot to move smoothly and quickly without getting stuck in math problems.

3. The "Oops, I Pushed Too Hard" Safety Net

What if the robot pushes a box, and it turns out the box was actually heavy?

The Fix: The robot has a "correction mode." If it tries to push something and gets stuck (or the object doesn't move), it immediately realizes, "Okay, this isn't pushable after all!" It updates its memory, marks that object as a "wall," and quickly backs up to a safe spot to try a different path.

The Real-World Result

The authors tested this on a real robot and in a high-tech simulation:

Scenario A: A robot faced a curtain. The old robots would stop. This robot realized the curtain was light, pushed through it, and kept going.
Scenario B: A robot faced a small box blocking a narrow hallway. Instead of taking a long, winding route around it, the robot gently nudged the box aside and walked straight through.
Scenario C: A robot faced a heavy shelf. It recognized it couldn't move, so it carefully navigated around it.

Why This Matters

This technology changes the game from "Avoid everything at all costs" to "Know what you can touch and what you can't."

It's the difference between a person who refuses to walk through a crowd because they don't want to bump into anyone, versus a person who knows how to gently squeeze past a few people to get to their destination faster. This makes robots much more efficient and useful in our messy, real-world homes and offices.

Here is a detailed technical summary of the paper "Direct Contact-Tolerant Motion Planning With Vision Language Models":

1. Problem Statement

Autonomous robots operating in cluttered environments often face scenarios where strict collision avoidance is impossible or inefficient. Traditional navigation algorithms treat all obstacles as rigid bodies that must be avoided, leading to failure in blocked paths. However, many real-world objects (e.g., curtains, empty boxes) are movable or deformable and can be safely contacted.

The core challenge is Contact-Tolerant Motion Planning (CTMP):

Reasoning: Determining which obstacles can be safely pushed or contacted based on robot capabilities, obstacle properties, and context.
Planning: Generating efficient paths that utilize controlled contact with movable objects while strictly avoiding fixed obstacles.
Limitations of Existing Methods: Current CTMP approaches rely on indirect spatial representations (e.g., prebuilt maps, convex obstacle sets). These introduce inaccuracies, lack adaptability to environmental changes, and struggle with the complex reasoning required to distinguish between movable and fixed objects in real-time.

2. Methodology: The DCT Framework

The authors propose DCT (Direct Contact-Tolerant), a system that integrates Vision-Language Models (VLMs) for direct point-perception and navigation. The system consists of two primary modules:

A. VLM Point Cloud Partitioner (VPP)

The VPP module is responsible for identifying movable obstacles and partitioning the raw LiDAR point cloud into contact-tolerant ( $P_{mov}$ ) and contact-intolerant ( $P_{fix}$ ) sets.

VLM-Driven Filtering: Uses an open-set grounding model to detect objects in RGB images based on language prompts. A task-conditioned VLM then filters these candidates to determine movability (e.g., "Is this box pushable?").
Memory-Driven Mask Propagation: Since VLM inference is too slow for every LiDAR frame, VPP caches the VLM's output (masks, captions, robot pose) in a temporal memory list.
- Viewpoint Warping: When the robot moves, masks are propagated to the current frame using planar homography based on odometry.
- Reconciliation: New detections are matched with propagated masks using Intersection-over-Union (IoU). Unmatched detections update the mask; unmatched propagated masks are discarded for safety.
- 3D Refinement: The projected masks are applied to the LiDAR scan. 3D Euclidean clustering (DBSCAN) is used to remove isolated noise (outlier suppression) and fill gaps in the object (cluster completion), ensuring spatial coherence.

B. VPP Guided Navigation (VGN)

The VGN module performs motion planning directly on the partitioned point cloud.

Direct Point Constraints: Unlike methods that approximate obstacles as sets, VGN formulates the problem as a Large-Scale Model Predictive Control (LMPC) problem with thousands of direct distance constraints against $P_{fix}$ .
Learned Solver (DNN): Solving LMPC with thousands of constraints in real-time is computationally prohibitive. The authors train a specialized Deep Neural Network (DNN) to imitate the optimization process.
- The DNN takes the robot state, shape, and point cloud as input and predicts the optimal dual variables for the distance constraints in microseconds.
- This converts iterative optimization into real-time feed-forward inference.
Correction Mechanism: If a push fails (e.g., the robot gets stuck or the object doesn't move), the system triggers a "correcting mode." It re-labels the failed obstacle points as non-movable ( $P_{fix}$ ), reverses the robot to a safe state, and replans.

3. Key Contributions

VPP (Real-time Partitioner): A novel module that leverages VLMs for contact-tolerance reasoning and uses memory-based mask propagation to achieve high-frequency, accurate partitioning of point clouds into movable and fixed sets.
VGN (Fast Learned Planner): A control framework that operates directly on contact-partitioned point clouds. It utilizes a DNN to solve complex, large-scale optimization problems in real-time, bypassing the latency of traditional solvers.
Robust Real-World Implementation: The system was implemented and tested in both high-fidelity simulation (Isaac Sim) and on a real car-like robot, demonstrating superior performance in diverse cluttered scenarios.

4. Experimental Results

The authors evaluated DCT against NeuPAN (state-of-the-art direct point navigation) and Ellis22 (a hybrid CTMP approach) across various scenarios:

VLM Performance: Evaluation of different VLMs (GPT-5, Gemini 2.5, etc.) showed that GPT-5 offered the best balance of precision and recall for identifying pushable objects, achieving 100% precision in identifying correct pushable obstacles.
Navigation Efficiency:
- Movable Obstacles: In scenarios with narrow paths requiring contact, DCT successfully navigated while NeuPAN failed (due to treating obstacles as hard constraints). DCT was significantly faster (e.g., 4.22s vs. 4.92s) than Ellis22.
- Fixed Obstacles: When obstacles could not be pushed, DCT adapted by avoiding them efficiently, outperforming Ellis22 which relied on conservative map inflation and took much longer (5.72s vs. 15.42s).
Mixed Environments: In environments with varying ratios of fixed and movable obstacles, DCT achieved 100% success rates when at least two movable obstacles were present, with the shortest navigation times and path lengths.
Real-World Deployment: Tests on a physical robot successfully demonstrated navigating through a curtain (deformable) and pushing a small box while avoiding fixed obstacles like chair legs, validating the system's ability to handle arbitrary shapes and dynamic interactions.

5. Significance

This paper addresses a critical gap in mobile robotics: the inability of current planners to efficiently reason about and interact with movable objects in unstructured environments.

Paradigm Shift: It moves away from indirect, map-based representations toward direct point-perception, significantly improving adaptability to environmental uncertainties.
Safety & Efficiency: By leveraging VLMs for semantic reasoning and DNNs for fast control, the system achieves a high degree of safety (strictly avoiding fixed obstacles) while maximizing efficiency (pushing movable ones).
Generalizability: The approach is robust across different obstacle types (rigid, deformable) and environments, offering a scalable solution for complex real-world navigation tasks.

Direct Contact-Tolerant Motion Planning With Vision Language Models

1. The "Smart Eye" (VLM Point Cloud Partitioner)

2. The "Fast Reflexes" (VGN Navigation)

3. The "Oops, I Pushed Too Hard" Safety Net

The Real-World Result

Why This Matters

1. Problem Statement

2. Methodology: The DCT Framework

A. VLM Point Cloud Partitioner (VPP)

B. VPP Guided Navigation (VGN)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers