VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

Imagine you are teaching a robot how to think like a human. You show it a picture of a cat chasing a mouse, and then a picture of a dog chasing a ball. You ask the robot: "If I show you a bird chasing a worm, what happens next?"

A human child would instantly say, "The bird catches the worm!" because they understand the pattern: "Animal A hunts Animal B."

But what if the robot just sees the pictures and says, "I don't know, maybe the bird flies away"? That's the problem researchers are trying to solve with a new test called VOILA.

Here is a simple breakdown of the paper, using everyday analogies.

1. The Problem: Robots Can "See" But Can't "Connect the Dots"

Current AI models (like the ones powering chatbots and image generators) are amazing at describing what they see. If you show them a picture of a dog, they can tell you, "That's a brown dog running."

However, they struggle with analogical reasoning. This is the ability to look at two things, figure out the relationship between them, and apply that same relationship to a new situation.

The Analogy: Imagine you are playing a game of "Spot the Difference." You have two pairs of photos. Pair A shows a change (e.g., a small cup becomes a big cup). Pair B shows a small ball. The robot needs to figure out that the ball should become a big ball.
The Reality: Most AI models get stuck. They can describe the small cup and the big cup, but they fail to realize the rule is "make it bigger" and apply it to the ball.

2. The Solution: Introducing "VOILA"

The researchers created a massive new test called VOILA (Visual Open-ended Analogical Intelligence and Learning Assessment). Think of VOILA as a giant, dynamic "puzzle factory" for AI.

Instead of giving the AI a multiple-choice quiz (where it just picks A, B, C, or D), VOILA asks the AI to create the answer.

The Setup: The AI sees three images:
1. Image A (e.g., 2 bears driving a car).
2. Image B (e.g., 2 bears reading a book).
3. Image C (e.g., 4 rabbits playing soccer).
The Task: The AI must figure out the rule between A and B (the bears changed from driving to reading) and apply it to C. So, the AI must generate an image of 4 rabbits reading a book.

3. The "Distraction" Trap

To make the test harder, the researchers added a "Distraction" mode (VOILA-WD).

The Metaphor: Imagine a teacher asks, "If 2 apples turn into 4 apples, what happens to 2 oranges?"
The Distraction: But on the table, there is also a random banana and a red hat. A smart human ignores the banana and hat and focuses only on the apples and oranges.
The AI Struggle: The AI often gets confused by the extra "noise." It tries to include the banana or the hat in its answer because it can't filter out what is important and what is just background clutter.

4. The Results: Humans Win, AI Struggles

The researchers tested the smartest AI models (like GPT-4o and LLaMa) against humans.

The Scoreboard:
- Humans: Got about 70% of the answers right.
- Best AI: Got about 13% to 29% right.
The Takeaway: The AI is like a student who has memorized the dictionary but doesn't understand the story. It can describe the pictures perfectly, but when asked to use logic to solve a puzzle, it gets lost.

5. How to Help the AI Think Better

The paper found a few tricks that help the AI perform better, though it still isn't human-level:

The "Step-by-Step" Ladder (Least-to-Most): Instead of asking the AI to "Solve the puzzle," the researchers broke it down:
1. "Describe the first picture."
2. "What changed between picture 1 and 2?"
3. "Apply that change to picture 3."
4. "Draw the result."
- Result: This helped the AI a lot, like giving a student a hint sheet.
The "Collage" vs. "Separate" Problem: When the three images were squished together into one big collage, the AI got confused. When shown as three separate images in a row, it did much better. It's like trying to read a book where the pages are glued together vs. flipping through them one by one.

Summary

VOILA is a new, tough test that shows current AI is still a "parrot" rather than a "thinker." It can repeat what it sees, but it struggles to understand the hidden rules connecting different images. While giving the AI step-by-step instructions helps, there is still a huge gap between how humans solve these puzzles and how machines do.

The researchers hope that by using this test, we can build AI that doesn't just "see" the world, but actually understands the logic behind it.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have achieved remarkable success in static visual tasks like image captioning and visual question answering (VQA). However, they struggle with high-level cognitive tasks, specifically analogical reasoning across multiple images.

Limitations of Current Benchmarks: Existing benchmarks often rely on multiple-choice formats (selecting from predefined options), which fail to test a model's ability to generate solutions or engage in open-ended abstract reasoning.
The Gap: There is a lack of large-scale, dynamic benchmarks that require models to infer relational rules from visual contexts and apply them to generate new visual content, a task requiring "creation" rather than just "evaluation" (per Bloom's taxonomy).

2. Methodology: The VOILA Benchmark

The authors introduce VOILA (Visual Open-ended Analogical Inference and Learning Assessment), a large-scale, open-ended, and dynamic benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.

A. Task Definition

The core task is a visual analogy problem structured as $A : A' :: B : B'$ .

Input: Three images ( $I_1, I_2, I_3$ $I_{1}, I_{2}, I_{3}$ ).
- $I_1$ and $I_2$ form a reference pair sharing visual and abstract contextual information.
- $I_3$ is the application pair start.
Goal: The model must infer the transformation rule between $I_1$ and $I_2$ and apply it to $I_3$ to generate the content of the fourth image ( $I_4$ ).
Output: The model must generate both a textual description (number, subject, action) and the actual image for $I_4$ .

B. Dataset Construction

Dynamic Generation: VOILA is not a static dataset but a pipeline capable of generating over 6.4 million unique analogy scenarios.
Properties: The system manipulates three key properties:
1. Number of subjects (1–4).
2. Subject type (14 categories, e.g., animals, humans).
3. Action (13 categories, e.g., swimming, reading).
Rule Configurations: Four rule types are applied to these properties:
1. Stable: Property remains constant ( $P \to P$ ).
2. Change: Property changes ( $P_1 \to P_2$ ).
3. Arithmetic: Numerical change based on addition/subtraction ( $N_1 \to N_2$ ).
4. Distraction: A property changes arbitrarily and is irrelevant to the analogy (introduced to test robustness).
Sub-datasets:
- VOILA-ND (No Distraction): Contains only Stable, Change, and Arithmetic rules.
- VOILA-WD (With Distraction): Includes the Distraction rule, making the task significantly harder by introducing irrelevant visual changes.
Image Generation: Images are generated using SDXL based on text prompts, followed by manual filtering to ensure alignment between text and visual content.

C. Evaluation Protocol

The evaluation follows a multi-step reasoning pipeline using Least-to-Most (L2M) prompting:

Describing Images: Model describes the content of $I_1, I_2, I_3$ .
Identifying Relations: Model determines which properties changed or remained stable between $I_1$ and $I_2$ .
Applying Relationships: Model predicts the properties of $I_4$ based on the rules derived in step 2 and the content of $I_3$ .
Generating Image: Model generates the visual output for $I_4$ .

Performance is scored at each step using GPT-4o to compare model outputs against ground truth text descriptions.

3. Key Contributions

VOILA Benchmark: A scalable, open-ended benchmark for visual analogical reasoning that moves beyond multiple-choice formats to require generative capabilities.
Dynamic Data Pipeline: A method to generate millions of distinct visual analogy questions by permuting properties and rules, ensuring no data leakage and high diversity.
Comprehensive Analysis: A detailed investigation into why models fail, analyzing performance across input formats (collage vs. sequential), prompting strategies (L2M vs. direct), and the impact of distractions.
Human Baseline: Establishment of a strong human performance baseline (via Amazon Mechanical Turk) to quantify the gap between human and machine intelligence in this domain.

4. Experimental Results

The study evaluated state-of-the-art MLLMs (including GPT-4o, LLaMa 3.2, Qwen2-VL, CogVLM2, etc.) on VOILA.

Performance Gap: There is a massive performance gap between humans and MLLMs.
- Human Accuracy: ~71% on VOILA-WD and ~70% on VOILA-ND.
- Best MLLM Accuracy:
  - VOILA-WD: LLaMa 3.2 achieved only 13% accuracy in applying relationships.
  - VOILA-ND: GPT-4o achieved 29% accuracy.
- Conclusion: Current models struggle significantly with abstract relational reasoning and knowledge transfer.
Step-wise Degradation: Accuracy drops sharply as the reasoning chain deepens.
- Models perform well in Image Description (~60–80%).
- Performance plummets in Identifying Relations (~15–40%).
- Performance is lowest in Applying Relationships (<15% for most models).
Impact of Distractions: The inclusion of distraction rules (VOILA-WD) drastically reduces performance for all models except LLaMa 3.2, which surprisingly performed better on the harder dataset than on the easier one, suggesting it is better at filtering noise.
Input Format Sensitivity:
- Sequential Images vs. Collage: Providing images as separate sequential inputs improved performance by an average of 40% compared to a single image collage.
- Text vs. Vision: In ablation studies, GPT-4o performed significantly better (49% vs. 22%) when reasoning over text descriptions of images rather than the images themselves, highlighting a gap in visual reasoning capabilities.
Prompting Strategies:
- L2M vs. Direct Answer: Least-to-Most prompting consistently outperformed direct answering, confirming that decomposing the task helps models manage complexity.
- Ground Truth Ablation: Even when provided with ground truth descriptions and relationships, models struggled to apply these rules to new images (dropping to 17% accuracy), indicating a failure in reasoning application rather than just perception.

5. Significance and Conclusion

Cognitive Limitations: The results demonstrate that current MLLMs lack the ability to perform high-order cognitive tasks involving abstract relational reasoning and knowledge transfer, which are essential for human-level intelligence.
Benchmark Utility: VOILA serves as a rigorous stress test for future MLLMs, moving the field beyond simple perception toward complex reasoning and generation.
Future Directions: The paper suggests that improving MLLMs requires better mechanisms for handling multi-step reasoning, filtering irrelevant information (distractions), and bridging the gap between visual perception and abstract rule application.

In summary, VOILA reveals that while MLLMs are powerful perceptual tools, they currently fail to replicate the human capacity for analogical reasoning, particularly when required to generate novel solutions based on abstract rules.