VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

This paper introduces VOILA, a dynamic benchmark that evaluates multimodal large language models' ability to perform abstract relational reasoning through visual analogies, revealing that current models significantly struggle with inter-image relationships compared to human performance despite improvements from multi-step prompting strategies.

Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang

Published 2026-02-26
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot how to think like a human. You show it a picture of a cat chasing a mouse, and then a picture of a dog chasing a ball. You ask the robot: "If I show you a bird chasing a worm, what happens next?"

A human child would instantly say, "The bird catches the worm!" because they understand the pattern: "Animal A hunts Animal B."

But what if the robot just sees the pictures and says, "I don't know, maybe the bird flies away"? That's the problem researchers are trying to solve with a new test called VOILA.

Here is a simple breakdown of the paper, using everyday analogies.

1. The Problem: Robots Can "See" But Can't "Connect the Dots"

Current AI models (like the ones powering chatbots and image generators) are amazing at describing what they see. If you show them a picture of a dog, they can tell you, "That's a brown dog running."

However, they struggle with analogical reasoning. This is the ability to look at two things, figure out the relationship between them, and apply that same relationship to a new situation.

  • The Analogy: Imagine you are playing a game of "Spot the Difference." You have two pairs of photos. Pair A shows a change (e.g., a small cup becomes a big cup). Pair B shows a small ball. The robot needs to figure out that the ball should become a big ball.
  • The Reality: Most AI models get stuck. They can describe the small cup and the big cup, but they fail to realize the rule is "make it bigger" and apply it to the ball.

2. The Solution: Introducing "VOILA"

The researchers created a massive new test called VOILA (Visual Open-ended Analogical Intelligence and Learning Assessment). Think of VOILA as a giant, dynamic "puzzle factory" for AI.

Instead of giving the AI a multiple-choice quiz (where it just picks A, B, C, or D), VOILA asks the AI to create the answer.

  • The Setup: The AI sees three images:
    1. Image A (e.g., 2 bears driving a car).
    2. Image B (e.g., 2 bears reading a book).
    3. Image C (e.g., 4 rabbits playing soccer).
  • The Task: The AI must figure out the rule between A and B (the bears changed from driving to reading) and apply it to C. So, the AI must generate an image of 4 rabbits reading a book.

3. The "Distraction" Trap

To make the test harder, the researchers added a "Distraction" mode (VOILA-WD).

  • The Metaphor: Imagine a teacher asks, "If 2 apples turn into 4 apples, what happens to 2 oranges?"
  • The Distraction: But on the table, there is also a random banana and a red hat. A smart human ignores the banana and hat and focuses only on the apples and oranges.
  • The AI Struggle: The AI often gets confused by the extra "noise." It tries to include the banana or the hat in its answer because it can't filter out what is important and what is just background clutter.

4. The Results: Humans Win, AI Struggles

The researchers tested the smartest AI models (like GPT-4o and LLaMa) against humans.

  • The Scoreboard:
    • Humans: Got about 70% of the answers right.
    • Best AI: Got about 13% to 29% right.
  • The Takeaway: The AI is like a student who has memorized the dictionary but doesn't understand the story. It can describe the pictures perfectly, but when asked to use logic to solve a puzzle, it gets lost.

5. How to Help the AI Think Better

The paper found a few tricks that help the AI perform better, though it still isn't human-level:

  • The "Step-by-Step" Ladder (Least-to-Most): Instead of asking the AI to "Solve the puzzle," the researchers broke it down:
    1. "Describe the first picture."
    2. "What changed between picture 1 and 2?"
    3. "Apply that change to picture 3."
    4. "Draw the result."
    • Result: This helped the AI a lot, like giving a student a hint sheet.
  • The "Collage" vs. "Separate" Problem: When the three images were squished together into one big collage, the AI got confused. When shown as three separate images in a row, it did much better. It's like trying to read a book where the pages are glued together vs. flipping through them one by one.

Summary

VOILA is a new, tough test that shows current AI is still a "parrot" rather than a "thinker." It can repeat what it sees, but it struggles to understand the hidden rules connecting different images. While giving the AI step-by-step instructions helps, there is still a huge gap between how humans solve these puzzles and how machines do.

The researchers hope that by using this test, we can build AI that doesn't just "see" the world, but actually understands the logic behind it.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →