Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

This paper demonstrates that improvements in multilingual and visually rich retrieval-augmented generation benchmarks are primarily driven by better document representation and preprocessing rather than advanced retrieval mechanisms, urging the field to adopt decomposed evaluation metrics to accurately attribute progress.

Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi

Published 2026-03-05
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG," translated into simple, everyday language with some creative analogies.

The Big Idea: It's Not the Detective, It's the Evidence

Imagine you are a detective (the AI Retrieval System) trying to solve a mystery. You have a massive library of case files (the Documents), and you need to find the specific page that holds the answer to a question.

For a long time, the tech world believed that "old-school" detectives (using simple keyword matching like BM25) were terrible at solving complex cases involving foreign languages or documents full of charts and graphs. They thought these old detectives needed to be replaced by super-intelligent, high-tech detectives (using Multimodal AI) that could "see" the pictures and understand the context.

This paper says: "Wait a minute. We might be blaming the wrong thing."

The authors argue that the old detectives weren't actually bad at finding clues; they were just being handed bad evidence. The problem wasn't the detective's skill; it was that the evidence they were given was messy, incomplete, or written in a language the detective couldn't read.


The Two Main Problems

The paper looks at two specific types of "messy evidence":

1. The "Foreign Language" Problem (Multilingual)

Imagine you have a file written in Japanese or Arabic.

  • The Old Way: The AI tried to read the file, but its "scanner" (OCR) was bad at recognizing the characters. It turned a clear sentence into gibberish like #@$%&. Then, the old detective tried to find keywords in that gibberish and failed.
  • The New Finding: When the authors gave the old detective a better scanner that could perfectly read the foreign characters, the old detective suddenly became amazing. It didn't need a super-intelligent brain; it just needed to be able to read the text clearly.

2. The "Chart and Graph" Problem (Visually Rich)

Imagine a document with a pie chart showing "75% of sales come from Product A."

  • The Old Way: The scanner looked at the chart and just saw a circle. It didn't know what the numbers meant. It told the detective, "There is a circle here." The detective looked for the word "Product A" in the text, couldn't find it (because it was inside the circle), and gave up.
  • The New Finding: When the authors added a step where a smart assistant described the chart in plain English (e.g., "A pie chart showing 75% is Product A"), the old detective could suddenly find the answer. The detective didn't need to learn how to "see" the chart; it just needed someone to tell it what the chart said.

The Experiment: The "Fixed Detective" Test

To prove their point, the researchers ran a controlled experiment. They kept the "detective" (the retrieval algorithm) exactly the same and only changed the "evidence preparation" (the scanning and describing).

The Results were shocking:

  • By simply using better scanners and adding descriptions for charts, the old, simple method (BM25) jumped from being a terrible detective to being almost as good as the fancy, expensive AI models.
  • In some cases, the simple method caught up to the "super-detectives" by a huge margin just by cleaning up the data.

The Analogy: The Librarian vs. The Translator

Think of the Retrieval System as a Librarian who knows exactly where every book is on the shelf.

  • The Problem: The books are written in a language the Librarian doesn't speak, and some pages have pictures instead of words.
  • The Misconception: People thought the Librarian was stupid and needed to be replaced by a Polyglot Artist (the Multimodal AI) who could speak every language and interpret art.
  • The Reality: The Librarian was actually very smart! They just needed a Translator (better OCR) to read the foreign text and a Describer (image captioning) to explain the pictures. Once the Translator and Describer did their job, the original Librarian could find the right book instantly.

Why Does This Matter?

  1. Stop Wasting Money: We don't necessarily need to build massive, expensive AI models for every task. Sometimes, we just need to fix the "plumbing" (the text extraction and cleaning).
  2. Better Benchmarks: The paper argues that when we test AI systems, we shouldn't just say "Model A is better than Model B." We need to ask: "Did Model A just get better text to read, or is it actually smarter at finding things?"
  3. Focus on the Basics: Before we build the next generation of "thinking" AI, we should make sure the basic text extraction is perfect. A brilliant brain is useless if it's fed garbage data.

The Bottom Line

The paper's title, "I Can't Believe It's Not Better," is a play on the famous "I Can't Believe It's Not Butter" margarine.

The authors are saying: "You think you need a fancy new Multimodal AI to handle complex documents? No! If you just fix the text extraction and add descriptions for images, your old, simple system will perform just as well."

The gap wasn't in the Retrieval (finding the needle); the gap was in the Representation (making sure the needle was visible in the haystack).