CMRAG: Co-modality-based visual document retrieval and question answering

Imagine you are trying to find a specific fact inside a massive, messy library. This library doesn't just have books with words; it has books filled with charts, photos, handwritten notes, and complex diagrams.

This is the problem the paper CMRAG tries to solve.

The Problem: The "One-Eyed" Librarians

Currently, there are two main types of "librarians" (AI systems) trying to help you find information in these documents:

The Text-Only Librarian: This librarian is incredibly fast at reading words. If you ask, "What is the revenue in 2023?", they scan the text and find the answer instantly. But, if the answer is inside a complex chart or a photo of a graph, they are blind. They can't see the picture, so they miss the answer.
The Image-Only Librarian: This librarian is great at looking at pictures. They can "see" the chart and the graph. But, they are terrible at reading the tiny, dense text inside the document. If the answer is hidden in a paragraph of text, they might miss it because they are too focused on the visual layout.

Both librarians are trying to solve the puzzle with only one eye open, which leads to mistakes.

The Solution: The "Bilingual" Librarian (CMRAG)

The authors of this paper built a new kind of librarian called CMRAG (Co-Modality RAG). Think of this librarian as a bilingual expert who can read text and see images perfectly at the same time.

Here is how they built this super-librarian, using a simple analogy:

1. The Universal Translator (Unified Encoding Model)

Imagine you have a text document and a photo of a document. In the old days, the computer treated these as two completely different languages (like English and French) that didn't mix well.

CMRAG introduces a Universal Translator. It takes the text, the image, and your question, and translates them all into a single, shared "secret language" (a mathematical space).

The Analogy: Imagine you have a map where "The Red House" (a text description) and "A picture of a red house" are marked at the exact same spot. Now, the computer can instantly see that they are the same thing, even though one is words and the other is a picture.

2. The Fair Scorekeeper (Unified Retrieval)

When the librarian searches for an answer, they get two scores:

Score A: How well the text matches your question.
Score B: How well the image matches your question.

The problem? Text scores and image scores are measured on different scales (like comparing "miles" to "kilometers"). If you just add them up, one might accidentally dominate the other.

CMRAG uses a Fair Scorekeeper. Before adding the scores, they normalize them (like converting both miles and kilometers to meters). This ensures that the text and the image get a fair vote. The final decision is a balanced mix of what the text says and what the picture shows.

Why Does This Matter?

The paper tested this new librarian on difficult tasks, like reading financial reports, scientific papers, and slide decks.

The Result: CMRAG consistently beat the old "one-eyed" librarians.
The "Aha!" Moment: In one test case, the old image-only librarian looked at a chart and guessed the wrong number because it couldn't read the tiny labels. The text-only librarian missed the chart entirely. CMRAG, however, read the text and saw the chart, combining them to get the perfect answer.

The Big Takeaway

The authors also released a giant dataset of "Question + Text + Image" triplets to help other researchers build better systems.

In short: If you want an AI to understand complex documents (like a doctor's report, a legal contract, or a scientific paper), you can't just ask it to "read" or just ask it to "look." You need it to do both simultaneously, treating text and images as partners rather than competitors. CMRAG is the framework that makes that partnership possible.

Here is a detailed technical summary of the paper "CMRAG: Co-modality–based visual document retrieval and question answering".

1. Problem Definition

The paper addresses the limitations of existing Retrieval-Augmented Generation (RAG) systems when applied to Visual Document Question Answering (VDQA). While RAG is effective for open-domain QA, handling multimodal documents (PDFs, scanned reports, slides containing text, tables, charts, and complex layouts) remains challenging. Current approaches fall into two suboptimal categories:

Text-based RAG: Relies on layout parsing and OCR to extract text. While semantically stable, it loses critical visual information (images, charts, layout context) and struggles with unstructured content.
Image-based RAG: Treats document pages as raw images and feeds them directly into Vision-Language Models (VLMs). While it captures visual context, it often ignores the precise semantic information carried by text, leading to retrieval bottlenecks and hallucinations in generation.

The core challenge is how to unify text and image modalities within a single retrieval framework to leverage the semantic precision of text and the perceptual grounding of images simultaneously.

2. Methodology: The CMRAG Framework

The authors propose CMRAG, a framework that unifies text and image modalities through two key components: a Unified Encoding Model (UEM) and a Unified Co-modality–informed Retrieval (UCMR) method.

A. Unified Encoding Model (UEM)

The UEM projects queries, parsed document texts, and document images into a shared latent embedding space.

Architecture: Built on the SigLIP backbone, it integrates three encoders:
- $E_q$ : Query encoder (frozen, from SigLIP).
- $E_I$ : Image encoder (frozen, from SigLIP).
- $E_T$ : Text encoder (initialized as a length-extended copy of $E_q$ to handle long parsed texts).
Training Objective: The model is trained using a Dual-Sigmoid Alignment (DSA) loss. This is a pairwise contrastive objective using a sigmoid function to align query-text and query-image triplets.
- The loss function ( $L$ ) combines text-query loss ( $L_T$ ) and image-query loss ( $L_I$ ) with a weighting parameter $\lambda$ .
- Crucially, only the text encoder ( $E_T$ ) is updated during training; the query and image encoders remain frozen to preserve their pre-trained multimodal alignment capabilities.

B. Unified Co-modality–informed Retrieval (UCMR)

Standard linear combination of similarity scores often fails because text and image embeddings have different distributions and scales. UCMR addresses this via a statistical normalization pipeline:

Sigmoid Normalization: Raw inner product scores ( $z$ ) for both modalities are passed through a sigmoid function to map them to the $[0, 1]$ range.
Z-score Normalization: The sigmoid-normalized scores are further standardized using Z-score normalization (subtracting the mean $\mu$ and dividing by the standard deviation $\sigma$ ) calculated over the corpus. This shifts both modalities to a comparable distribution (zero mean, unit variance).
Fusion: The final retrieval score is a weighted sum of the normalized text and image scores:
$\tilde{s}_i = \beta \tilde{z}^T_i + (1-\beta) \tilde{z}^I_i$
where $\beta$ controls the contribution of the text modality.

C. Generation Pipeline

Once the top- $k$ relevant pages are retrieved (containing both the page image and the parsed text), they are combined with the user query into a structured prompt and fed into a VLM generator (e.g., Qwen2.5-VL) to produce the final answer.

3. Key Contributions

CMRAG Framework: A novel RAG architecture that unifies text and image modalities, demonstrating that co-modality integration significantly outperforms single-modality baselines.
UEM (Unified Encoding Model): An efficient model using a single set of encoders for all modalities, trained end-to-end with a pairwise sigmoid loss to create a unified embedding space without heavy computational overhead.
UCMR (Unified Retrieval Method): A novel retrieval strategy employing statistical normalization (Sigmoid + Z-score) to effectively fuse cross-modal similarity scores, solving the issue of distributional discrepancies between text and image encoders.
Large-Scale Dataset: The authors constructed and released a large-scale triplet dataset of (query, text, image) examples derived from an open-source visual document corpus to facilitate future research in co-modality learning.
Comprehensive Evaluation: Extensive experiments across multiple VDQA benchmarks showing consistent improvements over strong baselines.

4. Experimental Results

The authors evaluated CMRAG on six diverse VDQA benchmarks: MMLongBench, REAL-MM-RAG, LongDocURL, FinReport, FinSlides, TechReport, and TechSlides.

Retrieval Performance (MRR@10):
- CMRAG consistently outperformed single-modality baselines (BGE for text-only, CLIP/SigLIP for image-only).
- It achieved the best or near-best scores across most datasets. For instance, on FinSlides, it reached 78.10%, surpassing SigLIP (71.88%) and BGE (65.01%).
- On text-heavy datasets like FinReport, the text-only baseline (BGE) performed well, but CMRAG still achieved competitive results (41.85% vs BGE's 49.62%, though SigLIP was lower at 39.29%), showing robustness across modalities.
Generation Performance:
- CMRAG consistently outperformed baselines in answer generation accuracy.
- Oracle Experiments: When provided with ground-truth evidence from both modalities, the system achieved the highest accuracy, confirming the complementary value of text and images.
Ablation Studies:
- Removing the normalization step ("w/o norm") caused a significant performance drop, validating the necessity of UCMR.
- An ensemble of SigLIP + BGE (using separate encoders) performed slightly better in some text-heavy scenarios, suggesting that scaling training data for the unified encoder could yield further gains.
Efficiency: The framework introduces negligible latency during online retrieval because images and texts are encoded offline. The online cost is dominated by a single query encoding step.

5. Significance and Impact

Bridging the Modality Gap: CMRAG demonstrates that treating documents as purely text or purely images is insufficient. A unified approach that leverages the semantic precision of text and the contextual grounding of images is essential for complex document understanding.
Practical Applicability: The framework is applicable to real-world scenarios such as enterprise knowledge search (slides, reports, manuals), technical troubleshooting (UI screenshots, diagrams), and scientific document assistance, where information is inherently multimodal.
Methodological Insight: The paper highlights that distributional alignment is critical in cross-modal retrieval. The proposed statistical normalization provides a robust, generalizable solution for fusing scores from heterogeneous encoders.
Future Direction: The results suggest that while current unified models are effective, there is significant potential for improvement by scaling up training data, as evidenced by the strong performance of the SigLIP+BGE ensemble.

In conclusion, CMRAG establishes a new state-of-the-art for Visual Document QA by effectively unifying retrieval and generation across text and image modalities, offering a scalable and efficient solution for the next generation of multimodal document intelligence systems.