CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

The paper proposes CC-VQA, a training-free method that mitigates knowledge conflicts in Knowledge-Based Visual Question Answering by integrating vision-centric conflict reasoning with correlation-guided encoding and decoding to achieve state-of-the-art performance on multiple benchmarks.

Yuyang Hong, Jiaqi Gu, Yujin Lou, Lubin Fan, Qi Yang, Ying Wang, Kun Ding, Yue Wu, Shiming Xiang, Jieping Ye

Published 2026-03-02
📖 4 min read☕ Coffee break read

The Big Problem: The "Know-It-All" Robot vs. The "Newspaper"

Imagine you have a brilliant robot assistant (a Vision Language Model) who has read almost every book ever written. It knows a lot about the world. But, its knowledge is static—like a library book printed in 2020. It doesn't know what happened yesterday.

Now, imagine you show this robot a picture of a rare bird and ask, "What is this?" The robot guesses based on its old books. But, you also hand it a fresh newspaper clipping (retrieved external knowledge) that says, "Actually, this is a new species discovered last week!"

Here is the conflict:

  1. The robot's internal memory says: "That's a Sparrow."
  2. The newspaper says: "That's a rare Blue Jay."

If the robot ignores the newspaper, it's wrong. If it blindly trusts the newspaper without checking the picture, it might be tricked by a fake article. This is the Knowledge Conflict problem. Current AI systems often get confused, ignore the new info, or get misled by bad info.

The Solution: CC-VQA (The Smart Detective)

The authors propose a new method called CC-VQA. Think of CC-VQA not as a robot that just reads, but as a Smart Detective who uses two special tools to solve the mystery.

Tool 1: The "Visual Reality Check" (Vision-Centric Contextual Conflict Reasoning)

Most AI methods try to solve this conflict just by reading the text. They argue: "The book says X, the newspaper says Y."

CC-VQA says: "Wait, let's look at the actual photo!"

  • The Analogy: Imagine a detective trying to solve a crime. Two witnesses give conflicting stories. Instead of just arguing about who is lying, the detective looks at the crime scene photo.
  • How it works:
    1. The AI asks itself: "What does my internal memory say this object looks like?"
    2. It asks: "What does the new article say it looks like?"
    3. Crucially, it looks at the actual image you provided.
    4. If the image shows a bird with a red beak, but the newspaper claims it's a blue bird with a yellow beak, the AI realizes the newspaper is likely wrong (or at least, doesn't match the visual evidence).
    5. It uses the visual features (colors, shapes, patterns) to decide which story makes sense.

Tool 2: The "Noise Filter" (Correlation-Guided Encoding and Decoding)

Sometimes, the "newspaper" (retrieved context) is huge. It has 100 sentences, but only one sentence actually answers your question. The other 99 are just fluff, history, or irrelevant details. This "noise" confuses the AI.

  • The Analogy: Imagine you are trying to hear a friend's voice in a crowded, noisy room.
    • Old AI: Tries to listen to everyone in the room equally. It gets overwhelmed by the chatter.
    • CC-VQA: Puts on noise-canceling headphones that only amplify the friend's voice and mutes everyone else.
  • How it works:
    1. Compression (The Mute Button): The AI calculates how "related" every single sentence in the article is to your specific question. If a sentence is only 10% related, it gets "squished" (compressed). It's still there, but the AI pays very little attention to it.
    2. Amplification (The Volume Knob): Sentences that are highly relevant get a "volume boost."
    3. Smart Guessing: When the AI generates the final answer, it uses a special scoring system. If a sentence is highly relevant but contradicts the image, the AI gets a "warning signal" and thinks twice before trusting it.

Why is this a Big Deal?

  1. It's Training-Free: You don't need to spend months teaching the robot new tricks. It's like giving a smart person a better set of glasses and a better notebook, rather than re-educating them.
  2. It Wins: When tested on tough quizzes about animals, landmarks, and plants, CC-VQA got significantly more questions right than other methods. It reduced the number of times the AI got "tricked" by bad information.
  3. It's Visual First: It realizes that in a visual question, the picture is the ultimate truth-teller. If the text and the picture disagree, the picture usually wins.

Summary in One Sentence

CC-VQA is a smart system that solves AI confusion by using the actual image to fact-check conflicting text, while simultaneously filtering out irrelevant noise to focus only on the most important clues.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →