CMRAG: Co-modality-based visual document retrieval and question answering

The paper proposes CMRAG, a novel framework that enhances visual document retrieval and question answering by unifying text and image modalities through a shared embedding space and a normalized co-modality retrieval method, thereby outperforming existing single-modality approaches.

Wang Chen, Wenhan Yu, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, Jizhou Huang

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are trying to find a specific fact inside a massive, messy library. This library doesn't just have books with words; it has books filled with charts, photos, handwritten notes, and complex diagrams.

This is the problem the paper CMRAG tries to solve.

The Problem: The "One-Eyed" Librarians

Currently, there are two main types of "librarians" (AI systems) trying to help you find information in these documents:

  1. The Text-Only Librarian: This librarian is incredibly fast at reading words. If you ask, "What is the revenue in 2023?", they scan the text and find the answer instantly. But, if the answer is inside a complex chart or a photo of a graph, they are blind. They can't see the picture, so they miss the answer.
  2. The Image-Only Librarian: This librarian is great at looking at pictures. They can "see" the chart and the graph. But, they are terrible at reading the tiny, dense text inside the document. If the answer is hidden in a paragraph of text, they might miss it because they are too focused on the visual layout.

Both librarians are trying to solve the puzzle with only one eye open, which leads to mistakes.

The Solution: The "Bilingual" Librarian (CMRAG)

The authors of this paper built a new kind of librarian called CMRAG (Co-Modality RAG). Think of this librarian as a bilingual expert who can read text and see images perfectly at the same time.

Here is how they built this super-librarian, using a simple analogy:

1. The Universal Translator (Unified Encoding Model)

Imagine you have a text document and a photo of a document. In the old days, the computer treated these as two completely different languages (like English and French) that didn't mix well.

CMRAG introduces a Universal Translator. It takes the text, the image, and your question, and translates them all into a single, shared "secret language" (a mathematical space).

  • The Analogy: Imagine you have a map where "The Red House" (a text description) and "A picture of a red house" are marked at the exact same spot. Now, the computer can instantly see that they are the same thing, even though one is words and the other is a picture.

2. The Fair Scorekeeper (Unified Retrieval)

When the librarian searches for an answer, they get two scores:

  • Score A: How well the text matches your question.
  • Score B: How well the image matches your question.

The problem? Text scores and image scores are measured on different scales (like comparing "miles" to "kilometers"). If you just add them up, one might accidentally dominate the other.

CMRAG uses a Fair Scorekeeper. Before adding the scores, they normalize them (like converting both miles and kilometers to meters). This ensures that the text and the image get a fair vote. The final decision is a balanced mix of what the text says and what the picture shows.

Why Does This Matter?

The paper tested this new librarian on difficult tasks, like reading financial reports, scientific papers, and slide decks.

  • The Result: CMRAG consistently beat the old "one-eyed" librarians.
  • The "Aha!" Moment: In one test case, the old image-only librarian looked at a chart and guessed the wrong number because it couldn't read the tiny labels. The text-only librarian missed the chart entirely. CMRAG, however, read the text and saw the chart, combining them to get the perfect answer.

The Big Takeaway

The authors also released a giant dataset of "Question + Text + Image" triplets to help other researchers build better systems.

In short: If you want an AI to understand complex documents (like a doctor's report, a legal contract, or a scientific paper), you can't just ask it to "read" or just ask it to "look." You need it to do both simultaneously, treating text and images as partners rather than competitors. CMRAG is the framework that makes that partnership possible.