OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

This paper introduces OmniSpatial, a comprehensive benchmark grounded in cognitive psychology with over 8.4K annotated samples across four major categories, which reveals significant limitations in current vision-language models' spatial reasoning capabilities and explores strategies like PointGraph and SpatialCoT to address them.

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to navigate the world. You might think the hardest part is recognizing a chair or a cup. But in reality, the real challenge isn't just seeing the object; it's understanding how that object moves, how it fits in a box, how it looks from your neighbor's window, and what happens if you push it.

This paper, "OmniSpatial," is like a giant, rigorous "driver's license test" for Artificial Intelligence (AI) eyes and brains. It argues that while current AI models are great at simple tasks (like saying "the cat is on the mat"), they are still terrible at complex spatial reasoning.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The AI is a "Flat Earther"

Current AI models are like people who have only ever looked at a photograph of a room. They know where things are in that one picture. But if you ask them, "If I walk around the table, what will I see?" or "If I drop this glass, will it break on the floor or the table?", they often get it wrong.

The authors found that even the smartest AI models (like the latest versions of GPT or Gemini) are hitting a "ceiling." They are scoring 90%+ on old tests, but those tests were too easy—like asking a math genius to do simple addition. They need harder tests to see if the AI can actually think in 3D space.

2. The Solution: The "OmniSpatial" Gym

The team built a new, massive gym called OmniSpatial. Instead of just asking "What is this?", they ask questions that require the AI to do mental gymnastics. They organized these challenges into four main categories:

  • Dynamic Reasoning (The Action Movie):
    • The Analogy: Imagine watching a car chase. Can the AI predict where the car will be in 5 seconds? Can it tell if a pedestrian is about to step into the road?
    • The Test: The AI looks at a traffic scene and has to predict future movements or spot hidden dangers.
  • Complex Spatial Logic (The Origami Master):
    • The Analogy: Think of a puzzle where you have to unfold a piece of paper in your head to see the pattern, or figure out how to fit a weirdly shaped sofa through a narrow door.
    • The Test: The AI is shown a folded paper or a 3D shape and must figure out what it looks like when unfolded or rotated.
  • Spatial Interaction (The Puzzle Solver):
    • The Analogy: Imagine trying to park a car in a tight spot or figuring out which button to press on a remote control based on where your hand is.
    • The Test: The AI has to plan a path, avoid obstacles, or figure out how to manipulate objects in a cluttered room.
  • Perspective Taking (The Empathy Engine):
    • The Analogy: This is the hardest one. Imagine you are standing in a room, but the AI has to answer, "What does the person standing behind you see?" It requires the AI to leave its own "viewpoint" and mentally step into someone else's shoes (or camera lens).
    • The Test: The AI looks at a photo and has to describe the scene from a completely different angle that isn't shown in the picture.

3. The Results: The AI is Still a Toddler

When they ran their "driver's license test" on the world's best AI models, the results were humbling.

  • Human Performance: Humans scored about 92%. We are naturally good at this because we live in a 3D world.
  • AI Performance: The best AI models only scored around 55-57%.
  • The Gap: The AI is still struggling with things humans do instinctively, like imagining a room from the back or predicting how a falling object will bounce. It's like a toddler who can name colors but can't yet tie their shoes.

4. The Fix: Giving the AI "Training Wheels"

The authors didn't just stop at pointing out the problem; they tried to fix it with two clever tricks:

  • PointGraph (The "Cheat Sheet"):
    • How it works: Instead of just showing the AI the picture, they give it a "map" of the objects. It's like giving a student a diagram with arrows pointing to "Chair," "Table," and "Door" before asking the question.
    • Result: This helped the AI understand the layout better, boosting its score slightly.
  • SpatialCoT (The "Mental Movie"):
    • How it works: Humans solve spatial problems by imagining the scene from different angles. The authors used a tool to generate new pictures of the scene from different angles (like turning a 3D model) and showed these to the AI.
    • Result: This was a huge help. By giving the AI a "movie" of the scene from different views, it could finally "see" the answer it was missing.

The Big Picture

This paper is a wake-up call. We are building AI that can write poetry and write code, but they are still clumsy when it comes to the physical world. If we want robots to drive cars, fold laundry, or help the elderly, they need to pass the OmniSpatial test.

In short: We built a harder test, found that AI is still failing it, and showed that giving AI a "mental map" and "multiple camera angles" helps it get smarter. It's the first step toward building AI that doesn't just see the world, but truly understands it.