Multimodal Integration of Human-Like Attention in Visual Question Answering

The paper introduces MULAN, a novel Visual Question Answering model that integrates human-like attention from both image and text modalities into neural self-attention layers, achieving state-of-the-art performance on the VQAv2 dataset with significantly fewer trainable parameters than prior methods.

Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bace, Andreas Bulling

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a mystery. You have a picture of a scene and a question about it, like "What is the child digging in?" To solve this, you need to look at the picture and read the question carefully, figuring out which parts of the image matter most for that specific question.

This is exactly what computers do in a field called Visual Question Answering (VQA). But often, computers get lazy. They might look at the whole picture at once or guess the answer based on common habits (like assuming a child is always digging in a sandbox) rather than actually looking at the specific details.

Here is a simple breakdown of the paper's solution, MULAN, using some everyday analogies:

1. The Problem: The "Lazy Detective"

Current AI models are like detectives who have a bad habit of "jumping to conclusions."

  • The Image: They might look at the whole photo but miss the tiny detail that answers the question.
  • The Text: They might read the question but skip over important words, focusing only on the first few words they see.
  • The Result: They get the answer wrong because they didn't pay attention to the right clues.

2. The Solution: Hiring a Human Guide

The researchers realized that humans are naturally good at knowing what to look at. When you see a picture of a child digging, your eyes naturally zoom in on the shovel and the dirt, ignoring the background trees.

The paper introduces MULAN (Multimodal Human-like Attention Network). Think of MULAN as a super-intelligent AI detective who has hired a human guide to help them focus.

  • The Human Guide (Text): There is a special "eye" that watches how humans read. It knows that in the question "What is the child digging in?", the words "digging" and "in" are the most important clues. MULAN uses this guide to make the AI pay extra attention to those specific words.
  • The Human Guide (Image): There is another "eye" that watches how humans look at photos. It knows to look at the child's hands and the object they are holding, rather than the sky or the grass. MULAN uses this guide to make the AI zoom in on the right spot in the picture.

3. How It Works: The "Traffic Cop"

In the old AI models, the computer tried to figure out what was important all by itself. It was like a chaotic traffic intersection where every car (piece of information) was trying to go everywhere at once.

MULAN installs a Traffic Cop (the human attention signal) at the intersection.

  • When the AI tries to process the question, the Traffic Cop waves the important words forward and slows down the unimportant ones.
  • When the AI looks at the image, the Traffic Cop points directly at the relevant object and tells the AI, "Look here! Ignore the rest!"

By combining the guide for the text and the guide for the image, MULAN forces the AI to look at the picture and the question together, exactly how a human would.

4. The Results: Smarter and Faster

The researchers tested this new system on a very difficult test called VQAv2.

  • Better Scores: MULAN got the highest score ever recorded (about 74% accuracy), beating all previous models.
  • Lighter Weight: Usually, to get smarter, AI models need to be huge and heavy (like a giant truck). But because MULAN uses these human guides to do the heavy lifting, the model itself can be much smaller and lighter (about 80% fewer "brain cells" or parameters). It's like upgrading a bicycle with a turbocharger instead of building a massive truck.

5. Why It Matters: Solving the "Long Question" Problem

One of the biggest tricks AI used to play was ignoring long questions. If you asked, "What is the color of the shirt the man in the red hat is wearing?", the AI would often just guess "red" because it saw a red hat, ignoring the rest of the sentence.

MULAN is much better at this. Because it has the human guide telling it to read the whole sentence and look at the whole picture, it can handle long, tricky questions much better. It stops "jumping to conclusions" and actually solves the puzzle.

In a Nutshell

MULAN is a new way to teach computers to see and read like humans. Instead of guessing, it uses "human eye-tracking" data as a cheat sheet to know exactly where to look and what words to focus on. The result is a smarter, faster, and more accurate AI that can answer complex questions about images without getting distracted.