LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

This paper introduces LRR-Bench, a synthetic benchmark for evaluating Vision-Language Models on absolute and 3D spatial understanding tasks, revealing that current models significantly lag behind human performance and often fail completely on complex spatial reasoning challenges.

Fei Kong, Jinhao Duan, Kaidi Xu, Zhenhua Guo, Xiaofeng Zhu, Xiaoshuang Shi

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you have a very smart robot friend who can read books, write poems, and describe pictures in beautiful detail. You might think, "If this robot can describe a sunset so well, it must know exactly where everything is in that picture!"

This paper, titled LRR-Bench, is like a reality check for those robots. The authors built a special test to see if these "Vision-Language Models" (VLMs) can actually understand space—like knowing what's on the left, what's on the right, or if something is moving toward you.

Here is the breakdown of their findings using some everyday analogies:

1. The Test: A "Spatial Gym" for Robots

The researchers created a gym with different types of exercises (tasks) to test the robots' spatial muscles. They split the tests into two main categories:

  • The "Static" Room (Absolute Position): Imagine looking at a photo of a living room. The robot is asked, "Is the cat on the left or the right?" or "Is the lamp in the top corner?"
    • The Twist: They made it harder by putting three different photos into one big puzzle and asking the robot to track where objects are in each tiny piece.
  • The "Moving" Room (3D & Motion): Imagine watching a video clip. The robot is asked, "Is the camera spinning?" or "Is that sheep walking forward, or is it just standing still while the camera moves?"
    • The Twist: They used a video game (Minecraft) to create these moving scenes because it's cheap and perfect for controlling exactly how things move.

Why use a video game? It's like a "clean room" for experiments. In the real world, photos can be messy. In the game, the researchers know exactly where every object is, so they can be 100% sure if the robot is right or wrong. This also means the robots couldn't have "cheated" by memorizing the answers from their training data.

2. The Results: The "Smart" Robot is Lost in Space

The results were shocking. While humans scored near-perfectly (like getting an A+ on every test), the most advanced AI models struggled badly.

  • The "Simple" Wins: The robots were okay at the easiest tasks, like saying "The ball is in the center." It's like a toddler pointing at a cookie.
  • The "Complex" Failures: As soon as the test involved movement or rotation, the robots got confused.
    • Analogy: Imagine you are in a car turning left. You know the scenery is moving right. But if you asked the robot, "Is the car moving left or is the world moving right?" the robot often got it wrong. It couldn't tell the difference between the camera moving and the object moving.
    • The "Zero" Score: On some of the hardest 3D tasks, the best robots scored almost zero. They were essentially guessing, like a student who didn't study and is just picking "C" on a multiple-choice test.

3. The "Thinking" Trap

The researchers tried a common trick: they told the robots, "Don't just guess; think step-by-step before you answer" (this is called Chain-of-Thought).

  • The Surprise: For simple tasks, thinking helped a little. But for the hard spatial tasks, thinking made it worse!
  • Analogy: It's like asking a confused person to explain why they are confused. Instead of solving the problem, they start making up stories (hallucinations) to justify their wrong answer. The more the robot tried to "reason" about the movement, the more it invented fake physics.

4. Bigger Isn't Always Better

Usually, in AI, if you make the model bigger (more "brain power"), it gets smarter.

  • The Reality: In this spatial test, making the models bigger didn't help much. Even the giant models (with billions of parameters) failed at understanding 3D space.
  • The "Special Training" Myth: Some models were specifically trained on 3D data to be "experts" in space. Surprisingly, these "experts" didn't do much better than the regular models. It's like taking a driving school course but still not knowing how to parallel park.

The Big Takeaway

This paper tells us that while AI is amazing at describing what it sees (like a poet), it is still terrible at understanding the physical world (like a pilot or a surgeon).

  • Current State: AI can tell you "There is a dog."
  • Missing Skill: AI struggles to know "The dog is running towards me, and if I turn left, the dog will be on my right."

The authors conclude that until we fix this, we can't fully trust these robots for things like self-driving cars or robotic arms that need to navigate the real world without crashing. The robots are smart, but they are currently "spatially blind."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →