TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
TiPToP is a modular, open-vocabulary robotic planning system that integrates pretrained vision foundation models with a Task and Motion Planner to solve multi-step manipulation tasks from RGB images and natural language instructions without requiring any robot-specific training data, achieving performance comparable to or better than fine-tuned vision-language-action models while enabling detailed failure mode analysis.