VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Imagine you are playing a game of "What Can You Actually See?" with a robot friend.

In this game, you show the robot a single photo and ask a simple question like, "Is the red ball visible in this picture?" or "Can you read the license plate on that car?"

The robot has three choices:

Yes, I see it clearly. (It's right there in the pixels.)
No, I definitely don't see it. (It's hidden, blurry, or not in the frame.)
I'm not sure. (The picture is too dark, or the object is too small to tell for sure.)

The paper you shared introduces a new test called VB (Visibility Benchmark). It's like a "driver's license exam" for AI robots, but instead of driving a car, they have to prove they know the difference between what is actually visible and what they are just guessing about.

Here is the breakdown of the paper using some everyday analogies:

1. The Core Problem: The "Guessing" Trap

Many AI robots are like over-eager students who will answer any question, even if they don't know the answer. If you show them a photo of a dark room and ask, "Is there a cat in the corner?", a bad AI might guess "Yes" or "No" just to be helpful.

Why is this dangerous?
Imagine an AI helping a blind person navigate a street. If the AI guesses "The crosswalk is clear" when it's actually too dark to see, someone could get hurt. The paper argues that knowing when to say "I don't know" is just as important as knowing the answer.

2. The Test Design: The "Magic Mirror" Game

To test if the robots are smart or just lucky, the researchers created a special game using 100 families of photos.

Think of each "family" as a set of four cards:

Card A (The Base): A photo where the answer is clearly "No." (e.g., A sign is covered by a tree).
Card B (The Text Flip): The same photo, but the question changes to "Is the sign hidden?" (Now the answer is "Yes").
Card C (The Image Flip): A slightly changed photo where the tree is moved, revealing the sign. The question stays the same. (Now the answer is "Yes").
Card D (The Double Flip): The changed photo with the changed question. (Now the answer is "No" again).

The Goal: The robot must realize that if you move the tree (Image Flip), the answer changes. If you just change the words (Text Flip), the answer changes. If the robot gets confused by these tiny changes, it's not really "seeing" the world; it's just memorizing patterns.

3. The "Second-Order" Puzzle: The Detective Work

Some questions in the test are trickier. They ask about what other people in the photo can see.

Example: "Does Bob know that Alice can't see the card?"

This is like a game of "Theory of Mind." The robot has to look at the photo, figure out where Bob is looking, where Alice is looking, and what is blocked from their view. It's not just about the robot's eyes; it's about simulating other people's eyes.

4. The Results: Who Passed the Exam?

The researchers tested 9 different AI models (some from big tech companies like Google and OpenAI, and some open-source ones).

The Top Students: GPT-4o and Gemini 3.1 Pro tied for first place. They were the best at saying "I see it," "I don't see it," or "I'm not sure" at the right times.
The "Honest" Student: One model (GPT-5) was very cautious. It said "I don't know" a lot. While this is safe, it missed out on points because it didn't answer enough questions.
The Open-Source Surprise: The best "free" model (Gemma 3 12B) did surprisingly well. It actually beat one of the older, paid models from a big company. This is like a home-cooked meal tasting better than a fast-food chain's old recipe.
The Weakness: Most robots were better at understanding changes in the words (Text Flip) than changes in the picture (Image Flip). It's like they are great at reading a menu but bad at noticing if the waiter actually brought the right dish.

5. Why This Matters

This paper isn't just about getting a high score. It's about safety.

Confidence Calibration: The paper checks if the robot's "confidence score" matches reality. If a robot says "I'm 99% sure" but is actually wrong, that's dangerous. The best robots (like Gemini 3.1 Pro) only say "I'm sure" when they are actually right.
The "Abstain" Option: The test rewards robots for saying "I can't tell" when the evidence is missing. In the real world, a robot that refuses to guess when it's dark is safer than one that guesses and crashes.

The Bottom Line

The VB Benchmark is a new way to teach AI to be humble. It forces them to admit when they can't see something, rather than making up facts. The results show that while the smartest AI models are getting very good at this, they still struggle with subtle visual changes and understanding what other people in a photo can see.

It's a step toward building AI that doesn't just "hallucinate" answers, but acts like a careful, observant human who knows the limits of their own vision.

Here is a detailed technical summary of the paper "VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images."

1. Problem Statement

Vision-language models (VLMs) are increasingly deployed in safety-critical domains (e.g., autonomous driving, assistive technology, medical imaging) where incorrect visual judgments can have severe consequences. A primary failure mode is hallucination: models often guess answers when visual evidence is insufficient, occluded, or ambiguous.

Current benchmarks often treat unanswerable questions as a binary "unknown" or fail to distinguish between a model's inability to see something versus its inability to reason about it. There is a lack of standardized evaluation for:

Visibility vs. Presence: Distinguishing whether an object is present in the scene versus visible in the photo.
Calibrated Abstention: The ability to explicitly withhold judgment (ABSTAIN) when a human observer cannot reliably answer.
Robustness to Minimal Edits: Verifying that models change their answers only when the underlying visual evidence changes, not due to text phrasing or noise.
Second-Order Perspective: Reasoning about what one agent in an image can infer about another agent's visual access (Theory of Mind grounded in pixels).

2. Methodology: The VB Benchmark

The authors introduce VB, a benchmark designed to isolate and measure visibility reasoning under controlled conditions.

A. Data Construction (2 × 2 Family Design)

The benchmark consists of 100 families of image-question pairs. Each family is constructed using a 2 × 2 design crossing:

Minimal Image Edit: An atomic scene change (e.g., moving an occluder, shifting a target into/out of frame, changing lighting) that alters exactly one visibility factor.
Minimal Text Edit: A question edit that flips the logical claim (e.g., changing "Is X visible?" to "Is X not visible?").

This yields four cells per family:

BASE: (Original Image, Original Question) $\rightarrow$ Gold Label: VISIBLY_FALSE
TEXT_FLIP: (Original Image, Edited Question) $\rightarrow$ Gold Label: VISIBLY_TRUE
IMAGE_FLIP: (Edited Image, Original Question) $\rightarrow$ Gold Label: VISIBLY_TRUE
DOUBLE_FLIP: (Edited Image, Edited Question) $\rightarrow$ Gold Label: VISIBLY_FALSE (Diagnostic only)

Note: The strict XOR pattern ensures that a single edit (text or image) flips the ground truth, while the double flip returns it to the original state.

B. Task Definition & Labels

Models must output a structured JSON containing:

Label: VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN.
Reason Code: A specific factor limiting visibility (e.g., OCCLUSION, OUT_OF_FRAME, GAZE_DIRECTION, INSUFFICIENT_CONTEXT).
Confidence: A scalar $[0, 1]$ representing the model's probability of correctness.

Categories: The benchmark covers 8 primary visibility factors (Gaze, Occlusion, Out-of-Frame, Lighting/Distance, Inherently Non-Visual, Augmented Vision Required, Insufficient Context) plus a Multi-Agent/Second-Order slice.

C. Evaluation Metrics

The paper proposes a composite scoring system tailored for safety-critical reasoning:

CAA (Confidence-Aware Accuracy with Abstention): Rewards correct high-confidence answers, gives zero for wrong answers, and grants partial credit ( $\alpha=0.25$ ) for correct abstentions.
MEFR (Minimal Edit Flip Rate): Measures robustness. It calculates the rate at which a model correctly flips its answer when the image or text is minimally edited, conditioned on getting the BASE case correct.
SelRank (Confidence-Ranked Selective Prediction): Measures if the model's confidence scores correlate with correctness (Risk-Coverage curve). High SelRank means the model is confident when right and abstains when wrong.
ToMAcc (Theory of Mind Accuracy): Accuracy specifically on the Multi-Agent/Second-Order subset.

Final Score: A weighted composite: $0.70 \times \text{CAA} + 0.15 \times \text{MEFR} + 0.10 \times \text{SelRank} + 0.05 \times \text{ToMAcc}$.

3. Key Contributions

The VB Benchmark: A novel dataset with 100 families (300 scored items per model) using a controlled 2×2 minimal edit design to test visibility reasoning specifically.
Metric Suite: Introduction of CAA, MEFR, and SelRank to evaluate not just accuracy, but calibration, robustness, and selective prediction capabilities.
Reason Code Taxonomy: An 8-category taxonomy that forces models to explain why evidence is missing (e.g., occlusion vs. out-of-frame), making failures actionable.
Comprehensive Evaluation: Evaluation of 9 models (3 flagship closed-source, 3 prior-gen closed-source, 3 open-source 8–12B) revealing significant capability gaps and calibration asymmetries.

4. Experimental Results

The authors evaluated nine models, including GPT-4o, Gemini 3.1 Pro, GPT-5, and open-source models like Gemma 3 12B.

A. Overall Performance

Top Performers: GPT-4o (0.728) and Gemini 3.1 Pro (0.727) tied for the highest composite scores.
Prior-Gen vs. Open-Source: The best open-source model, Gemma 3 12B (0.505), surpassed Claude 3.7 Sonnet (0.476), a prior-generation closed-source model. This suggests visibility reasoning is beginning to transfer to the 8–12B open-source scale.
Low Performers: Qwen3-VL-8B (0.419) and InternVL3-8B (0.445) struggled, partly due to output formatting issues (unparsable JSON).

B. Key Findings

Text-Flip vs. Image-Flip Asymmetry: For 6 of 9 models, Text-Flip Robustness (T_MEFR) exceeded Image-Flip Robustness (I_MEFR). Models are better at tracking logical negation in text than detecting subtle visual changes in minimally edited photos.
Calibration Variability: GPT-4o and Gemini 2.5 Pro had similar accuracy (CAA), but GPT-4o had a positive SelRank (0.144) while Gemini 2.5 Pro had a negative SelRank (-0.106). This indicates Gemini 2.5 Pro was often confident when wrong and uncertain when right, a dangerous trait for safety applications.
Second-Order Reasoning: GPT-4o dominated the Multi-Agent slice (ToMAcc 0.952), significantly outperforming all open-source models (max 0.714), highlighting a gap in complex perspective reasoning.
Abstention Behavior: GPT-5 abstained most frequently (78/300 items), adopting a cautious strategy that lowered its composite score despite high accuracy on answered items. Gemini 3.1 Pro abstained least (14/300).

5. Significance and Implications

Safety-Critical Reasoning: VB demonstrates that current VLMs often fail to distinguish between "not visible" and "not present," and frequently hallucinate answers when evidence is missing. The ability to ABSTAIN is shown to be a critical safety feature.
Robustness Gap: The asymmetry between text and image flip robustness suggests that current models rely heavily on linguistic priors rather than genuine visual grounding. Visual augmentation may be more effective than text augmentation for improving robustness.
Open-Source Progress: The fact that Gemma 3 12B outperforms a prior-gen closed-source model indicates that the "black box" advantage in visibility reasoning is narrowing, though a gap remains for flagship models.
Calibration is Key: The paper argues that accuracy alone is insufficient. A model with high accuracy but poor calibration (like Gemini 2.5 Pro in this study) is less safe for deployment than a slightly less accurate but well-calibrated model (like GPT-4o).

Conclusion

VB provides a rigorous framework for evaluating the "safety" of vision-language models by testing their ability to recognize the limits of visual evidence. The results indicate that while top-tier models are approaching human-like reliability in visibility reasoning, significant challenges remain in robustness to visual perturbations, confidence calibration, and second-order perspective reasoning. Future work should focus on closing the image-flip gap and improving confidence calibration through targeted training.