LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Imagine you have a very smart robot friend who can read books, write poems, and describe pictures in beautiful detail. You might think, "If this robot can describe a sunset so well, it must know exactly where everything is in that picture!"

This paper, titled LRR-Bench, is like a reality check for those robots. The authors built a special test to see if these "Vision-Language Models" (VLMs) can actually understand space—like knowing what's on the left, what's on the right, or if something is moving toward you.

Here is the breakdown of their findings using some everyday analogies:

1. The Test: A "Spatial Gym" for Robots

The researchers created a gym with different types of exercises (tasks) to test the robots' spatial muscles. They split the tests into two main categories:

The "Static" Room (Absolute Position): Imagine looking at a photo of a living room. The robot is asked, "Is the cat on the left or the right?" or "Is the lamp in the top corner?"
- The Twist: They made it harder by putting three different photos into one big puzzle and asking the robot to track where objects are in each tiny piece.
The "Moving" Room (3D & Motion): Imagine watching a video clip. The robot is asked, "Is the camera spinning?" or "Is that sheep walking forward, or is it just standing still while the camera moves?"
- The Twist: They used a video game (Minecraft) to create these moving scenes because it's cheap and perfect for controlling exactly how things move.

Why use a video game? It's like a "clean room" for experiments. In the real world, photos can be messy. In the game, the researchers know exactly where every object is, so they can be 100% sure if the robot is right or wrong. This also means the robots couldn't have "cheated" by memorizing the answers from their training data.

2. The Results: The "Smart" Robot is Lost in Space

The results were shocking. While humans scored near-perfectly (like getting an A+ on every test), the most advanced AI models struggled badly.

The "Simple" Wins: The robots were okay at the easiest tasks, like saying "The ball is in the center." It's like a toddler pointing at a cookie.
The "Complex" Failures: As soon as the test involved movement or rotation, the robots got confused.
- Analogy: Imagine you are in a car turning left. You know the scenery is moving right. But if you asked the robot, "Is the car moving left or is the world moving right?" the robot often got it wrong. It couldn't tell the difference between the camera moving and the object moving.
- The "Zero" Score: On some of the hardest 3D tasks, the best robots scored almost zero. They were essentially guessing, like a student who didn't study and is just picking "C" on a multiple-choice test.

3. The "Thinking" Trap

The researchers tried a common trick: they told the robots, "Don't just guess; think step-by-step before you answer" (this is called Chain-of-Thought).

The Surprise: For simple tasks, thinking helped a little. But for the hard spatial tasks, thinking made it worse!
Analogy: It's like asking a confused person to explain why they are confused. Instead of solving the problem, they start making up stories (hallucinations) to justify their wrong answer. The more the robot tried to "reason" about the movement, the more it invented fake physics.

4. Bigger Isn't Always Better

Usually, in AI, if you make the model bigger (more "brain power"), it gets smarter.

The Reality: In this spatial test, making the models bigger didn't help much. Even the giant models (with billions of parameters) failed at understanding 3D space.
The "Special Training" Myth: Some models were specifically trained on 3D data to be "experts" in space. Surprisingly, these "experts" didn't do much better than the regular models. It's like taking a driving school course but still not knowing how to parallel park.

The Big Takeaway

This paper tells us that while AI is amazing at describing what it sees (like a poet), it is still terrible at understanding the physical world (like a pilot or a surgeon).

Current State: AI can tell you "There is a dog."
Missing Skill: AI struggles to know "The dog is running towards me, and if I turn left, the dog will be on my right."

The authors conclude that until we fix this, we can't fully trust these robots for things like self-driving cars or robotic arms that need to navigate the real world without crashing. The robots are smart, but they are currently "spatially blind."

1. Problem Statement

While Vision-Language Models (VLMs) have advanced significantly in general visual understanding, their ability to perform precise spatial perception remains underexplored and insufficient for safety-critical applications like autonomous driving and robotic manipulation. Existing benchmarks primarily focus on:

Basic relative spatial relationships (e.g., "is object A to the left of object B?") in natural images.
Static positional inference.

They largely fail to address complex, realistic spatial capabilities such as motion perception, relative movement, sequential movement, and 3D spatial transformations (rotation and translation) involving both the camera and objects. Current models often rely on dataset contamination or lack the granularity to distinguish between absolute position and dynamic 3D changes.

2. Methodology: LRR-Bench

The authors introduce LRR-Bench, a fully synthetic benchmark and evaluation pipeline designed to test VLMs on spatial understanding without data contamination.

A. Data Generation Pipeline

To ensure low-cost generation and prevent training data leakage, the dataset is entirely synthetic:

Static Tasks (Absolute Position & Depth): Generated using Diffusion Models (Flux.1-S). Images are filtered using GroundingDINO (for object detection and confidence scoring) and SAM (Segment Anything Model) combined with Depth-Anything-V2 to ensure accurate depth and object presence.
Dynamic/3D Tasks (Movement & Rotation): Generated using Minecraft. The environment allows for precise control over camera and object kinematics (translation and rotation) to create image sequences.
Filtering: A post-processing step removes low-quality or ambiguous samples, ensuring the ground truth is mathematically exact.

B. Taxonomy of Tasks

The benchmark categorizes spatial understanding into two main domains, further decomposed into 9 specific tasks:

Absolute Spatial Understanding (2D):
- Position (Pos.): Detecting if an object is in a specific absolute location (e.g., bottom-left) within a single image.
- Position Combination (Pos. C.): Identifying object locations within a composite image containing multiple sub-images.
- Position Sequence (Pos. S.): Determining if object locations in a sequence of unrelated images match specific criteria (testing isolation of context).
3D Spatial Understanding (Dynamic):
- Depth (Dep.): Determining which of two objects is closer (front/back).
- Camera Rotation (Ca. R.): Detecting the direction of camera rotation in a sequence.
- Camera Movement (Ca. M.): Detecting the direction of camera translation (forward/back/left/right).
- Object Heading Direction (Obj. H. D.): Identifying the facing direction of a target object.
- Object Movement Direction (Obj. M. D.): Determining if an object's movement aligns with its heading.
- Object Movement (Obj. M.): Detecting if an object has moved relative to the background when both camera and object move.

Note: For object-centric tasks, a "Clear" version is provided where background distractions are removed to isolate the model's ability to track the target.

C. Evaluation Metrics

Scoring: Accuracy is converted to a score $s_i = 2(p_i - 50)$ if $p_i \ge 50$ , otherwise 0. This penalizes random guessing (50% accuracy) and rewards performance above chance.
Prompting Strategies: Models were tested with two strategies:
1. Direct Answer: "Yes/No" response.
2. Reasoning (CoT): "Think step-by-step" before answering.
Human Baseline: 10 human volunteers were tested to establish a performance ceiling.

3. Key Contributions

LRR-Bench Dataset: A fully synthetic, contamination-free dataset covering 9 distinct spatial tasks, ranging from 2D absolute positioning to complex 3D kinematic reasoning.
Comprehensive Evaluation: Evaluation of 20+ state-of-the-art VLMs (including GPT-4o, Qwen-VL, InternVL, LLaVA, and specialized 3D models) with parameter sizes up to 72B.
Critical Insights: The paper challenges the assumption that larger models or Chain-of-Thought (CoT) reasoning automatically improve spatial understanding.

4. Experimental Results

The results reveal a significant gap between human performance and current VLMs:

Human vs. Model Performance: Humans achieved near-perfect scores (~90-100%) across all tasks. In contrast, the best-performing VLMs (e.g., GPT-4o, Qwen2.5-72B) only approached human levels on the simplest tasks (Position and Depth).
3D Spatial Failure: VLMs perform near-zero on most 3D tasks (Camera Movement, Object Movement, Rotation). For example, on the Camera Movement task, most models scored 0, with GPT-4o reaching only 16.
The "Reasoning" Paradox:
- For simple tasks, CoT reasoning offered no significant benefit.
- For complex tasks, CoT often degraded performance. For instance, InternVL2.5-72B dropped from a score of 57 to 0 on the Camera Rotation task when forced to reason first, suggesting reasoning induces hallucinations in spatial contexts.
Scaling and Fine-tuning Ineffectiveness:
- Parameter Scaling: Increasing model size (e.g., from 7B to 72B) did not consistently improve 3D spatial reasoning.
- Specialized Training: Models fine-tuned on 3D datasets (e.g., LLaVA-3D, SpaceQwen) or using preference optimization (MPO) often performed worse than their base counterparts, indicating that current 3D training data or RL strategies do not generalize well to kinematic reasoning.
No Universal Winner: No single model dominated all tasks. A model excelling in absolute positioning often failed completely in movement detection.

5. Significance and Conclusion

Limitations of Current VLMs: The study proves that despite massive scaling and advanced training, VLMs lack an intrinsic, robust representation of 3D geometry and kinematics. They struggle to integrate multi-view information into a coherent spatial model.
Hallucination in Reasoning: The finding that CoT reasoning can harm spatial tasks suggests that VLMs' "reasoning" is often a post-hoc rationalization that introduces errors in geometric logic.
Future Directions: The paper highlights that improving spatial understanding requires more than just scaling or standard RLHF. It necessitates:
- Better integration of physical laws and kinematic principles.
- New training paradigms specifically targeting 3D transformations.
- Moving beyond static image analysis to dynamic, sequential spatial reasoning.

Conclusion: LRR-Bench serves as a critical diagnostic tool, revealing that current VLMs are fundamentally weak in spatial understanding, particularly regarding motion and 3D rotation, posing a significant barrier for their deployment in robotics and autonomous systems.