ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

The paper introduces ViPlan, the first open-source benchmark comparing VLM-grounded symbolic planning against direct VLM planning across Blocksworld and household robotics domains, revealing that symbolic approaches excel in visually grounded tasks while direct planning outperforms in scenarios requiring linguistic knowledge, alongside findings that Chain-of-Thought prompting offers no consistent benefit.

Matteo Merler, Nicola Dainese, Minttu Alakuijala, Giovanni Bonetta, Pietro Ferrazzi, Yu Tian, Bernardo Magnini, Pekka Marttinen

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to clean your house or stack blocks. You have two main ways to give it instructions, and a new study called ViPlan just put them head-to-head to see which one actually works better.

Here is the breakdown of the paper in simple terms, using some everyday analogies.

The Two Approaches: The "Intuitive Chef" vs. The "Strict Accountant"

The researchers compared two different ways robots (powered by AI) can plan their actions:

  1. VLM-as-Planner (The Intuitive Chef):

    • How it works: You show the robot a picture of the room and say, "Clean the dishes." The robot looks at the picture and immediately guesses, "Okay, I'll grab the fork, then the plate, then the cup." It makes a plan on the fly, relying on its general knowledge of how the world works.
    • The Analogy: This is like a Chef who has cooked a million meals. They don't need a recipe book; they just look at the ingredients and intuitively know what to do next. They are fast and creative but might miss a tiny detail if they aren't looking closely.
  2. VLM-as-Grounder (The Strict Accountant):

    • How it works: The robot doesn't guess the whole plan. Instead, it acts like a translator. It looks at the picture and answers very specific "Yes/No" questions to a strict logic engine (a symbolic planner).
      • Question: "Is the bowl on the table?" -> Answer: "Yes."
      • Question: "Is the bowl reachable?" -> Answer: "Yes."
      • Once it confirms all the facts, the "Accountant" (the logic engine) calculates the perfect, step-by-step plan.
    • The Analogy: This is like a Strict Accountant who refuses to move a single dollar until they have verified every single receipt. They are incredibly precise and won't make logical errors, but they are slow and can get confused if the receipts are blurry or if they can't see everything.

The Test: Two Different Kitchens

The researchers created two different "kitchens" (environments) to test these robots:

  1. The Block Tower (ViPlan-Blocksworld):

    • The Setup: A simple world with colored blocks stacked in columns. It's very clear, organized, and everything is visible.
    • The Winner: The Strict Accountant (VLM-as-Grounder) won easily.
    • Why? In this clean world, the "Yes/No" questions are easy to answer. The Accountant's precision shines because there are no hidden tricks. The Intuitive Chef got confused because they tried to guess too much without checking the facts first.
    • The Score: The Accountant solved 46% of the tasks, while the Chef only managed 9%.
  2. The Messy House (ViPlan-Household):

    • The Setup: A simulated robot arm in a messy house. There are drawers, cabinets, and objects hidden behind others. You can't see everything at once.
    • The Winner: The Intuitive Chef (VLM-as-planner) won by a landslide.
    • Why? In a messy house, you can't see everything. The Accountant gets stuck because it can't verify every single fact (e.g., "Is the spoon inside the closed drawer?"). It keeps asking questions it can't answer and freezes. The Chef, however, uses its "common sense" (linguistic knowledge) to guess, "Okay, the spoon is probably in the drawer, so I'll open it." It fills in the gaps with logic.
    • The Score: The Chef solved 34% of the tasks, while the Accountant only managed 5%.

The "Thinking Aloud" Experiment (Chain-of-Thought)

The researchers also tried a popular trick called Chain-of-Thought (CoT), where they asked the AI to "think out loud" before answering.

  • The Result: It didn't help much. In fact, for the "Intuitive Chef," it often made things worse.
  • The Analogy: It's like asking a nervous student to explain every single step of their math homework before writing the answer. Instead of solving the problem, they get stuck in a loop of overthinking, run out of time (or "token budget"), and fail to finish. The study found that current AI models aren't great at "thinking" their way through complex visual puzzles; they just get confused by their own words.

The Big Takeaway

This paper teaches us that there is no single "best" robot brain.

  • If you are in a clean, predictable, and fully visible world (like a factory assembly line), you want the Strict Accountant. You need someone who checks every fact to ensure nothing goes wrong.
  • If you are in a messy, unpredictable, and partially hidden world (like a real home), you want the Intuitive Chef. You need someone who can use common sense to guess what's happening even when they can't see it all.

The paper concludes that we need to stop trying to force one method to do everything. Instead, we need to build systems that know when to be a precise accountant and when to be a creative chef. Until then, our robots will keep struggling to clean up our messy houses!

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →