MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Imagine you are a detective trying to solve a mystery, but you have two very different sources of information:

Your Eyes: You can see the crime scene (the image), but sometimes it's blurry, or there are too many distractions (like a messy room full of junk).
Your Library: You can look up facts in a giant encyclopedia (retrieved knowledge), but the librarian sometimes hands you the wrong books, or books with pages torn out and irrelevant stories mixed in.

The Problem:
Most current "AI detectives" (called KB-VQA models) try to solve the mystery by just dumping all the books they find and all the things they see into a giant pile. They try to read everything at once.

Result: They get overwhelmed. The noise drowns out the clues. They might guess the wrong answer because they got distracted by a red herring in the picture or a confusing sentence in the book.

The Solution: MaS-VQA (The Smart Detective)
The authors of this paper created a new framework called MaS-VQA. Think of it as a detective who doesn't just read everything; they have a special filtering system and a brain trust.

Here is how it works, using simple analogies:

1. The "Mask-and-Select" Filter (Cleaning the Mess)

Before the detective tries to solve the case, they use a two-step cleaning process:

The "Mask" (Cleaning the Picture): Imagine looking at a photo of a forest. There are trees, birds, rocks, and a hidden path. The detective puts on special glasses (the Mask) that turn the boring rocks and random birds into gray static. They only let the path and the footprints shine through in color.
- In tech terms: This is the Attention Mask. It tells the AI, "Ignore the background noise in the image; only look at the parts that match the question."
The "Select" (Cleaning the Books): Now, imagine the librarian hands you a 500-page book about "Fruits." But your question is specifically about "Apples." The detective uses a highlighter (the Select mechanism) to circle only the three paragraphs about apples and crosses out the rest.
- In tech terms: This is Phrase Selection. It cuts out the irrelevant sentences from the retrieved text, keeping only the "high-signal" facts.

2. The "Brain Trust" (Connecting the Dots)

After cleaning up the picture and the text, the detective still might not have the full answer. Maybe the text says "Apples grow on trees," and the picture shows a tree, but it doesn't say who eats them.

This is where the Implicit Knowledge comes in.

Think of the AI's internal brain as a wise old librarian who has read every book in the world but doesn't have them on the desk right now.
The detective takes the cleaned picture and the highlighted text and asks the wise librarian: "Based on these specific clues, what else do you know that fits?"
Because the clues are so clean and focused, the librarian doesn't get confused. They can pull out the perfect piece of internal knowledge (like "Native Americans used these berries as food") to fill the gap.

3. The Final Verdict

Finally, the detective combines the Cleaned Visuals, the Cleaned Text, and the Librarian's Insight to give the final answer.

Why is this better?

Old Way: "Here is a messy room and a 500-page book. Guess the answer!" (Result: Confusion).
MaS-VQA Way: "Here is the one clue in the room and the one sentence in the book. Now, use your brain to connect them." (Result: Accuracy).

Real-World Example from the Paper

Imagine a question: "Who used the fruit of this plant as food?"

The Image: Shows a bush with red berries.
The Library: Hands you an article about the plant, but it also talks about its medicinal uses, its history in Europe, and how to prune it.
Without MaS-VQA: The AI gets confused by the pruning instructions and guesses "Gardeners."
With MaS-VQA:
1. Mask: It ignores the leaves in the background and focuses on the red berries.
2. Select: It highlights the sentence: "Native Americans ate the fruit fresh."
3. Brain Trust: It combines "Red berries" + "Native Americans ate them" to confidently answer: "Native Americans."

Summary

MaS-VQA is like a super-efficient assistant who knows how to ignore the noise, focus on the important clues, and then ask the right follow-up questions to get the perfect answer. It stops the AI from getting distracted and helps it reason much better, even when the information it finds is messy or incomplete.

1. Problem Statement

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information from an image with external knowledge (e.g., encyclopedic facts, commonsense). While existing methods have made progress, they face a critical challenge: noise and misalignment.

Noisy Inputs: Retrieved external knowledge is often partially irrelevant, semantically duplicated, or misaligned with the specific visual content. Similarly, visual region detectors often produce overlapping or irrelevant candidates.
Ineffective Integration: Current approaches (Explicit, Implicit, or Hybrid) often treat visual and textual relevance independently or use coarse filtering. This leads to "noise accumulation," where irrelevant evidence distracts the model, hindering the effective coupling of explicit knowledge (retrieved facts) and implicit knowledge (parametric knowledge within Large Multimodal Models).
Goal: To develop a framework that tightly couples explicit knowledge filtering with implicit reasoning to produce robust, accurate answers under noisy retrieval conditions.

2. Methodology: MaS-VQA

The authors propose MaS-VQA, a selection-driven framework that operates in three main stages: Retrieval, Explicit Knowledge Processing (Mask-and-Select), and Implicit Knowledge Processing.

A. Task Formulation

Given an image $I$ and a question $Q$ , the model predicts an answer $\hat{A}$ by leveraging an external knowledge base $K$ . The framework constructs an Explicit Knowledge Package $E = \{T, k, M\}$ and an Implicit Knowledge Paragraph $U$ .

B. Explicit Knowledge Processing (The "Mask-and-Select" Mechanism)

This is the core innovation, designed to prune noise from both visual and textual modalities simultaneously using a unified mechanism based on a pre-trained Image-Text Matching (ITM) encoder.

Multimodal Retrieval: The system retrieves top- $k$ candidate passages ( $T$ ) from a knowledge base using a multimodal retriever.
Visual Side: Knowledge-Guided Attention Mask ( $M$ )
- The model computes cross-attention weights between the retrieved text/question and image patches.
- It uses token-wise thresholding and adaptive token reweighting to identify which text tokens are most relevant to specific image regions.
- A binary mask is generated to suppress irrelevant image regions, forcing the model to focus only on visual evidence supported by the retrieved text.
Text Side: Question-Conditioned Phrase Selection ( $k$ )
- Using self-attention sensitivity signals, the model identifies which parts of the retrieved text ( $T$ ) are most critical for answering the question.
- It selects top- $m$ knowledge tokens and merges them into readable, high-salience phrases, effectively pruning noisy or weakly relevant text fragments.

C. Implicit Knowledge Processing

Once the explicit evidence is filtered and compacted:

A frozen Multimodal Large Language Model (MLLM) is prompted with the image, question, and the refined explicit package ( $E$ ).
The MLLM generates a concise Implicit Knowledge Paragraph ( $U$ ) (2–5 sentences). This paragraph acts as an intermediate representation that:
- Compresses long retrieved passages.
- Integrates grounded visual observations with textual knowledge.
- Activates the model's internal parametric knowledge (commonsense, reasoning priors) within a constrained semantic space.

D. Final Answer Prediction

The final answer is predicted by querying the frozen MLLM with the full context: Image ( $I$ ), Question ( $Q$ ), Explicit Package ( $E$ ), and Implicit Paragraph ( $U$ ).

3. Key Contributions

MaS-VQA Framework: A novel architecture that tightly couples explicit knowledge filtering with implicit parametric reasoning, moving beyond simple concatenation of retrieved text.
Unified Mask-and-Select Mechanism: A method that performs fine-grained selection on both visual regions (via attention masking) and retrieved text (via phrase selection) simultaneously. This produces compact, high-signal multimodal representations that mitigate noise accumulation.
Complementary Co-Modeling: The approach effectively uses filtered explicit evidence to guide the activation of implicit knowledge, ensuring the model's internal reasoning is grounded in relevant external facts.
Comprehensive Validation: Extensive experiments and ablation studies demonstrating the efficacy of each component.

4. Experimental Results

The method was evaluated on two challenging benchmarks: Encyclopedic-VQA (E-VQA) and InfoSeek.

Performance Gains: MaS-VQA achieved state-of-the-art results across multiple MLLM backbones (InternVL3-8B, Qwen3-VL-8B).
- On Encyclopedic-VQA, using Qwen3-VL-8B, MaS-VQA improved accuracy from 19.5% (Zero-shot) to 42.2% (Single-Hop) and 41.3% (All).
- On InfoSeek, it achieved the best results on unseen questions and entities (e.g., 43.9% on Unseen-E), demonstrating strong generalization.
Ablation Studies:
- Removing the Attention Mask or Phrase Selection individually caused performance drops, confirming that both visual and textual filtering are necessary.
- Removing Implicit Knowledge (relying only on explicit grounding) resulted in lower accuracy, proving that the model's internal knowledge is essential for handling gaps in retrieved evidence.
- Retrieval Breadth: Performance peaked at $k=5$ retrieved passages; increasing to $k=7$ introduced noise that slightly degraded performance, validating the need for the selection mechanism.
Qualitative Analysis: Case studies showed that MaS-VQA successfully corrected errors made by zero-shot models and standard retrieval-augmented methods by focusing on the correct visual regions and filtering out distracting text.

5. Significance and Impact

Robustness to Noise: The framework addresses a fundamental limitation in KB-VQA: the inability of current models to handle noisy, heterogeneous inputs. By explicitly filtering both modalities, it creates a "clean" reasoning context.
Interpretability: The use of attention masks and selected phrases provides a transparent view of why the model made a decision (which image regions and which text snippets were used).
Efficiency: Unlike methods requiring extensive retraining, MaS-VQA operates primarily through inference-time selection and prompting, making it adaptable to various frozen MLLM backbones without additional training costs.
Applications: The approach is highly relevant for educational assistants, accessibility tools, and information-seeking systems where factual accuracy and the ability to filter irrelevant information are critical.

In conclusion, MaS-VQA demonstrates that selection is as important as retrieval. By intelligently pruning noise from both visual and textual inputs before reasoning, the model can more effectively leverage the combined power of external facts and internal knowledge.