TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Imagine you want to teach a robot to make a sandwich, but you don't want to spend years teaching it exactly how to hold a knife or where the bread is. You just want to say, "Make me a sandwich," and have it figure it out.

For a long time, robotics researchers have tried two main ways to do this:

The "Giant Brain" Approach (VLA): Train a massive AI on thousands of hours of video of robots making sandwiches. It learns by rote memorization. It's great at what it's seen, but if you ask it to make a sandwich with a weird new ingredient or in a messy kitchen, it might get confused because it's just guessing based on patterns.
The "Strict Architect" Approach (TAMP): Give the robot a set of rigid rules and a map. It's very logical but needs you to tell it exactly where every object is and how big it is. If you move a chair, the robot crashes because its map is wrong.

TiPToP is a new system that tries to get the best of both worlds. Think of it as a Robot Project Manager who hires a team of specialized experts to get the job done.

The TiPToP Team: A Three-Part Crew

Instead of one giant brain trying to do everything, TiPToP breaks the job down into three distinct roles, working together like a well-oiled machine:

1. The Eyes and Translator (Perception Module)

What it does: The robot looks at the scene with its cameras. It doesn't just see "blobs"; it uses a super-smart AI (called a Vision Foundation Model) to say, "That's a banana," "That's a red block," and "That's a soda can blocking the banana."
The Analogy: Imagine a detective walking into a messy room. They don't just see a pile of stuff; they identify every item, draw a 3D map of where everything is, and figure out how to grab them without knocking anything over. They also listen to your instruction ("Put the banana in the box") and translate it into a clear to-do list.

2. The Strategist (Planning Module)

What it does: Once the detective has the map and the to-do list, the Strategist (using a tool called cuTAMP) figures out the steps. It asks: "If I grab the banana, will I hit the soda can? Oh, I need to move the soda can first." It simulates thousands of possible moves in a split second to find the perfect path.
The Analogy: This is the Chess Grandmaster. Before making a move, they think ten steps ahead. They don't just grab the banana; they realize, "Wait, the path is blocked. I need to move the can, then grab the banana, then put it in the box." It calculates the perfect route so the robot doesn't bump into things.

3. The Muscle (Execution Module)

What it does: This part takes the perfect plan and tells the robot's arms exactly how to move. It's like a dance instructor telling the robot's joints exactly where to go, how fast, and when to squeeze the gripper.
The Analogy: This is the Olympic Gymnast. They have the choreography (the plan) and they execute it with precision. They follow the steps exactly as the Strategist designed them.

Why is TiPToP Special?

1. It Needs No "Schooling" (Zero Training Data)
Most modern robot AIs need to be "fed" thousands of hours of video to learn how to move. TiPToP is different. It comes "out of the box" ready to work. It uses pre-trained AI models (like the ones that power your phone's photo app) to see and understand the world immediately. You don't need to show it 350 hours of videos of robots packing boxes; it just figures it out on the spot.

2. It's Modular (Like Lego)
If the "Eyes" get better next year, you can swap them out without breaking the rest of the robot. If the "Strategist" gets an upgrade, you just plug that in. This makes it easy to fix. If the robot fails, you know exactly which part of the team messed up (e.g., "The Eyes misidentified the object," or "The Strategist couldn't find a path").

3. It's Fast and Logical
In tests, TiPToP was often faster than the "Giant Brain" models. Why? Because the Giant Brain often tries, fails, tries again, and gets stuck in a loop. TiPToP plans the whole thing first, then executes it in one smooth motion. It's like the difference between someone guessing their way through a maze versus someone who has already drawn the map and walks straight to the exit.

The Results: How Did It Do?

The researchers tested TiPToP against a state-of-the-art AI (called $\pi0.5$ ) that had been trained on 350 hours of robot videos.

Simple Tasks: They were about equal.
Hard Tasks (Messy rooms, tricky instructions): TiPToP won. It was much better at ignoring distractions (like a toy in the way) and understanding complex instructions (like "put the red block on the pile of the same color").
The Weakness: TiPToP is a bit rigid. If the robot drops an item or the world changes while it's moving, it doesn't react as well as the "Giant Brain" models, which can adjust on the fly. But for most planned tasks, TiPToP is incredibly reliable.

The Bottom Line

TiPToP is like giving a robot a brain, a map, and a pair of hands, all working together perfectly. It proves that you don't need to train a robot on every single task in the world. Instead, if you give it the right tools to see, think, and plan, it can figure out how to do almost anything you ask, right from the start.

It's the difference between teaching a parrot to repeat a phrase versus teaching a human to understand the meaning and solve the problem. TiPToP is teaching the robot to think about what it's doing.

Here is a detailed technical summary of the paper "TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation."

1. Problem Statement

The paper addresses the challenge of building robotic manipulation systems that can execute complex, multi-step tasks specified by natural language instructions and RGB camera images without requiring task-specific training data or extensive robot demonstrations.

Current approaches face a dichotomy:

Vision-Language-Action (VLA) models (e.g., $\pi_0$ , OpenVLA) offer end-to-end learning from pixels to actions but require massive amounts of embodiment-specific training data, lack interpretability, and struggle with long-horizon reasoning or complex geometric constraints.
Traditional Task and Motion Planning (TAMP) offers structured reasoning over discrete actions and continuous geometry but typically relies on hand-crafted world models, specific object geometries, and is difficult to adapt to new robots or open-vocabulary objects.

The goal is to create a system that combines the generalization of foundation models with the structured reasoning of planning, achieving "out-of-the-box" deployment on arbitrary robots with minimal setup.

2. Methodology: TiPToP Architecture

TiPToP (TiPToP is a Planner That just works on Pixels) is a modular, open-vocabulary system that operates in three distinct stages. It takes a stereo RGB image pair and a natural language instruction as input and outputs a timed robot trajectory.

A. Perception Module

This module constructs an object-centric 3D scene representation from a single stereo image pair ( $t=0$ ). It runs two parallel branches:

3D Vision Branch:
- Depth Estimation: Uses FoundationStereo to predict dense depth maps, handling transparent and specular surfaces better than proprietary stereo matching.
- 3D Reconstruction: Unprojects depth to a point cloud and transforms it to the world frame using forward kinematics.
- Grasp Generation: Uses M2T2 (a foundation model) to predict 6-DoF grasp poses for the entire scene point cloud.
Semantic Branch:
- Object Detection & Grounding: Queries Gemini Robotics-ER 1.5 (a Vision-Language Model) to identify objects, generate bounding boxes, and translate the natural language instruction into a symbolic goal ( $G$ ) expressed as logical predicates (e.g., On(cracker, tray)).
- Segmentation: Uses SAM-2 to generate pixel-level masks for detected objects.
Integration:
- Combines geometry and semantics to create per-object meshes (using convex hulls of segmented point clouds) and assigns grasp candidates to specific objects.
- Detects the table surface via RANSAC.

B. Planning Module

The core planner uses cuTAMP, a GPU-parallelized Task and Motion Planning algorithm.

Skeleton Enumeration: Given the symbolic goal $G$ , a PDDL-style planner enumerates candidate "plan skeletons" (sequences of symbolic actions like Pick, Place, Move). Crucially, it generates skeletons that include obstacle removal (e.g., moving a soda can out of the way to reach a cracker).
Particle Initialization & Optimization: For each skeleton, cuTAMP samples continuous parameters (grasp poses, placement poses, robot configurations). It then performs differentiable optimization on a large batch of "particles" simultaneously to satisfy collision avoidance, stability, and kinematic constraints.
Motion Planning: Once a feasible skeleton and parameters are found, cuRobo (GPU-accelerated motion planner) generates the final time-parameterized, collision-free trajectory.

C. Execution Module

The system executes the planned trajectory open-loop (no visual feedback during execution).
It uses a custom joint-space impedance controller (implemented for Franka arms) to track the trajectory with high precision.
Note: The system does not replan based on execution-time observations, relying on the accuracy of the initial plan and trajectory tracking.

3. Key Contributions

Zero-Data Modular System: TiPToP requires zero robot training data. It leverages pretrained foundation models (Vision, Language, Grasp) and a GPU-accelerated planner, allowing deployment on new robots in under an hour with only camera calibration and URDF updates.
Open-Vocabulary & Semantic Reasoning: By integrating a VLM (Gemini), the system can interpret complex, open-vocabulary instructions (e.g., "serve the peanut butter crackers," "pick the largest toy") and ground them to specific objects in the scene, outperforming VLAs on semantic tasks.
Long-Horizon & Geometric Reasoning: The TAMP component explicitly reasons about geometric constraints and multi-step sequences (e.g., clearing obstacles), enabling success in cluttered environments where end-to-end models often fail.
Extensibility: The modular architecture allows components to be swapped or upgraded independently. The authors demonstrated this by extending the system to support a wiping primitive (cleaning a whiteboard) in under a day without modifying perception or execution infrastructure.
Open-Source Release: The code is released to facilitate research into modular manipulation and the integration of learning with planning.

4. Experimental Results

The authors evaluated TiPToP against $\pi_0.5$ -DROID (a state-of-the-art VLA fine-tuned on 350 hours of robot data) across 28 scenes (165 trials) in simulation and on real hardware (DROID, UR5e, WidowX).

Performance Comparison:
- Overall Success Rate: TiPToP achieved 74.6% success compared to 52.4% for $\pi_0.5$ -DROID.
- Task Categories:
  - Simple Tasks: Comparable performance.
  - Distractor Tasks: TiPToP significantly outperformed (60% vs. 26.7%), correctly identifying relevant objects amidst clutter.
  - Semantic Tasks: TiPToP excelled (71.3% vs. 46.8%), successfully interpreting complex referring expressions (e.g., "largest toy," "red A").
  - Multi-step Tasks: TiPToP showed a large advantage (75.2% vs. 52.2%), particularly in tasks requiring obstacle removal or constrained packing.
Efficiency: TiPToP was generally faster (e.g., ~15s vs. ~32s for simple tasks) because it plans a single optimal trajectory upfront, whereas the reactive VLA often idles or retries grasps.
Failure Analysis:
- TiPToP Failures: Primarily due to grasping failures (31/55) caused by imperfect mesh approximations (convex hulls) or small objects, and scene completion errors. The lack of closed-loop reactivity (open-loop execution) prevents recovery from slips.
- $\pi_0.5$ Failures: Primarily due to inability to reason about multi-step structures, distractor rejection, and complex semantic grounding.
Cross-Embodiment: Successfully deployed on UR5e and WidowX AI with minimal adaptation time (2–3 hours).

5. Significance and Future Directions

Paradigm Shift: The paper demonstrates that a modular system combining off-the-shelf foundation models with classical planning can outperform massive, end-to-end trained VLA models on complex, generalizable manipulation tasks without any robot-specific training.
Debuggability: The modular nature allows for precise failure tracing (e.g., distinguishing between a perception error and a planning error), which is difficult in black-box VLA models.
Complementarity: The results suggest a future direction where planning provides the high-level structure and semantic grounding, while learned policies (VLAs) serve as reactive skill primitives to handle low-level uncertainties (e.g., grasp recovery, cable manipulation).
Limitations: The current system is limited by open-loop execution (no recovery from slips) and single-viewpoint perception (convex hull approximations). Future work aims to integrate belief-space planning and learned reactive skills to address these gaps.

In summary, TiPToP represents a significant step toward generalist robotic manipulation, proving that structured planning combined with modern foundation models offers a robust, data-efficient, and highly adaptable alternative to purely end-to-end learning approaches.