VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

This paper introduces VB, a novel benchmark designed to evaluate vision-language models' ability to determine image visibility and appropriately abstain from answering when evidence is insufficient, utilizing controlled minimal edits and specialized metrics to reveal that top-tier models like GPT-4o and Gemini 3.1 Pro significantly outperform open-source alternatives in confidence-aware accuracy and perspective reasoning.

Neil Tripathi

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are playing a game of "What Can You Actually See?" with a robot friend.

In this game, you show the robot a single photo and ask a simple question like, "Is the red ball visible in this picture?" or "Can you read the license plate on that car?"

The robot has three choices:

  1. Yes, I see it clearly. (It's right there in the pixels.)
  2. No, I definitely don't see it. (It's hidden, blurry, or not in the frame.)
  3. I'm not sure. (The picture is too dark, or the object is too small to tell for sure.)

The paper you shared introduces a new test called VB (Visibility Benchmark). It's like a "driver's license exam" for AI robots, but instead of driving a car, they have to prove they know the difference between what is actually visible and what they are just guessing about.

Here is the breakdown of the paper using some everyday analogies:

1. The Core Problem: The "Guessing" Trap

Many AI robots are like over-eager students who will answer any question, even if they don't know the answer. If you show them a photo of a dark room and ask, "Is there a cat in the corner?", a bad AI might guess "Yes" or "No" just to be helpful.

Why is this dangerous?
Imagine an AI helping a blind person navigate a street. If the AI guesses "The crosswalk is clear" when it's actually too dark to see, someone could get hurt. The paper argues that knowing when to say "I don't know" is just as important as knowing the answer.

2. The Test Design: The "Magic Mirror" Game

To test if the robots are smart or just lucky, the researchers created a special game using 100 families of photos.

Think of each "family" as a set of four cards:

  • Card A (The Base): A photo where the answer is clearly "No." (e.g., A sign is covered by a tree).
  • Card B (The Text Flip): The same photo, but the question changes to "Is the sign hidden?" (Now the answer is "Yes").
  • Card C (The Image Flip): A slightly changed photo where the tree is moved, revealing the sign. The question stays the same. (Now the answer is "Yes").
  • Card D (The Double Flip): The changed photo with the changed question. (Now the answer is "No" again).

The Goal: The robot must realize that if you move the tree (Image Flip), the answer changes. If you just change the words (Text Flip), the answer changes. If the robot gets confused by these tiny changes, it's not really "seeing" the world; it's just memorizing patterns.

3. The "Second-Order" Puzzle: The Detective Work

Some questions in the test are trickier. They ask about what other people in the photo can see.

  • Example: "Does Bob know that Alice can't see the card?"

This is like a game of "Theory of Mind." The robot has to look at the photo, figure out where Bob is looking, where Alice is looking, and what is blocked from their view. It's not just about the robot's eyes; it's about simulating other people's eyes.

4. The Results: Who Passed the Exam?

The researchers tested 9 different AI models (some from big tech companies like Google and OpenAI, and some open-source ones).

  • The Top Students: GPT-4o and Gemini 3.1 Pro tied for first place. They were the best at saying "I see it," "I don't see it," or "I'm not sure" at the right times.
  • The "Honest" Student: One model (GPT-5) was very cautious. It said "I don't know" a lot. While this is safe, it missed out on points because it didn't answer enough questions.
  • The Open-Source Surprise: The best "free" model (Gemma 3 12B) did surprisingly well. It actually beat one of the older, paid models from a big company. This is like a home-cooked meal tasting better than a fast-food chain's old recipe.
  • The Weakness: Most robots were better at understanding changes in the words (Text Flip) than changes in the picture (Image Flip). It's like they are great at reading a menu but bad at noticing if the waiter actually brought the right dish.

5. Why This Matters

This paper isn't just about getting a high score. It's about safety.

  • Confidence Calibration: The paper checks if the robot's "confidence score" matches reality. If a robot says "I'm 99% sure" but is actually wrong, that's dangerous. The best robots (like Gemini 3.1 Pro) only say "I'm sure" when they are actually right.
  • The "Abstain" Option: The test rewards robots for saying "I can't tell" when the evidence is missing. In the real world, a robot that refuses to guess when it's dark is safer than one that guesses and crashes.

The Bottom Line

The VB Benchmark is a new way to teach AI to be humble. It forces them to admit when they can't see something, rather than making up facts. The results show that while the smartest AI models are getting very good at this, they still struggle with subtle visual changes and understanding what other people in a photo can see.

It's a step toward building AI that doesn't just "hallucinate" answers, but acts like a careful, observant human who knows the limits of their own vision.