Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

This paper evaluates the spatial reasoning capabilities of four state-of-the-art Vision-Language Models in interpreting robot motion preferences, demonstrating that Qwen2.5-VL achieves high accuracy in zero-shot and fine-tuned settings for enforcing constraints like object proximity and path style, thereby highlighting the potential of integrating VLMs into robot planning pipelines.

Wenxi Wu, Jingjing Zhang, Martim Brandão

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot butler to clean your living room. You don't just want it to pick up a cup; you want it to do so in a specific way. Maybe you want it to take a long, winding path to avoid a fragile vase, or perhaps you want it to sneak close to the sofa to grab a remote without knocking over the lamp.

For a long time, robots have been great at the "how" (moving from point A to point B) but terrible at the "how I want it" (the style, the mood, the specific constraints). They are like a very obedient dog that follows commands but doesn't understand the nuance of "don't step on the cat."

This paper asks a big question: Can modern AI "eyes and brains" (Vision-Language Models) understand these subtle human preferences just by looking at a picture of a robot's path?

Here is the breakdown of their experiment, explained with some everyday analogies.

1. The Setup: The "Robot Dance-Off"

The researchers didn't just ask the AI to guess. They set up a Robot Dance-Off.

  • The Choreography: First, they used a standard robot planner to generate 50 different ways a robot could move from the kitchen to the living room. Some paths were straight, some were zig-zags, some went close to the window, and some stayed far away.
  • The Visuals: They turned these 50 paths into a single image, like a map where every possible route is drawn in a different colored line (Red, Blue, Green, etc.).
  • The Judge: They showed this image to four different "AI Judges" (advanced Vision-Language Models like Qwen2.5-VL and GPT-4o) and gave them a text instruction, like: "Pick the path that goes around the table but stays far away from the painting."

The goal was to see if the AI could look at the map, read the instruction, and point to the correct colored line.

2. The Four Ways to Ask the AI

The researchers tried four different ways to present the paths to the AI, similar to how you might show someone a menu:

  1. The "One Big Map" (Single-Image): Show all 50 paths on one picture at once. The AI has to compare them all simultaneously.
  2. The "One-by-One" (Multi-Image): Show the AI one path, ask "Is this good?", then show the next, and so on.
  3. The "Descriptive Guide" (Visual Context): First, ask the AI to write a detailed description of the room and objects, then ask it to pick the path based on that description.
  4. The "Video Gallery" (Screenshot Gallery): Show a strip of images, like a comic book, showing the robot moving along each path step-by-step.

The Result: The "One Big Map" approach was the clear winner.

  • Why? Think of it like shopping for shoes. If you look at one shoe, then take it off and look at another, you might forget exactly how the first one felt. But if you lay all the shoes out on the floor, you can instantly compare them and pick the best one. The AI needs to see all the options side-by-side to make a fair comparison.

3. The Winners and Losers

  • The Champion: Qwen2.5-VL (a specific AI model) was the best judge. It got about 71% to 75% of the answers right. It was particularly good at understanding spatial relationships, like "stay away from the lamp."
  • The Runner-Up: GPT-4o (a very famous model) actually did worse than Qwen in this specific task. It's like having a very smart generalist who is great at many things but missed the specific details of this robot dance-off.
  • The Struggle: The AI models were great at understanding proximity (e.g., "stay close to the wall") but struggled more with style (e.g., "make a zig-zag path"). It's easier for an AI to see where something is than to judge the shape of a line.

4. The "Fine-Tuning" Magic Trick

The researchers also tried a little magic trick called Fine-Tuning.
Imagine you have a smart student who knows a lot of general facts but hasn't studied for this specific test. They gave the AI a tiny "crash course" (just 98 examples) on how to pick robot paths.

  • The Result: The AI's performance jumped significantly! One model improved its accuracy by over 60% after just a tiny bit of training.
  • The Metaphor: It's like taking a brilliant chef who knows how to cook everything, giving them a recipe card for "Spicy Tacos," and suddenly they become a world-class taco chef.

5. The Cost of Thinking

The paper also looked at how much "brain power" (computing cost) this took.

  • Showing all paths on one image was the most efficient. It used the least amount of data (tokens) but gave the best results.
  • Trying to show the AI many separate images or video strips was like asking it to read a whole library to find one sentence—it was expensive and didn't work as well.

The Bottom Line

This paper shows that we are getting closer to a future where you can talk to your robot like a human. You won't just say "Go to the kitchen." You could say, "Go to the kitchen, but take a slow, winding path so you don't scare the cat, and keep a safe distance from the coffee table."

While the AI isn't perfect yet (it sometimes hallucinates or picks the wrong color), the fact that it can understand these complex spatial preferences just by looking at a map is a huge step forward. It suggests that in the near future, robots won't just be tools that follow orders; they will be assistants that understand our style.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →