Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

Imagine you are hiring a new employee to work as a detective. Your goal is to see if they can solve crimes by combining clues from two sources: a crime scene photo (the image) and a witness statement (the text).

For years, researchers have been creating "tests" (benchmarks) to see how good these AI detectives are. They keep making harder and harder tests, hoping to find the perfect candidate. But this paper argues that the tests are broken, and the "detectives" are actually cheating.

Here is the breakdown of the paper's findings using simple analogies:

1. The "Cheat Sheet" Problem (Intra-modality Dependencies)

The authors discovered that most AI models don't actually need to look at both the photo and the text to get the right answer. They are like students who memorize the answer key instead of studying the lesson.

The Text Cheat: Sometimes, the AI ignores the picture entirely. If the question asks, "What color is the sky?" the AI knows the answer is "Blue" just by reading the words, even if the picture shows a red sunset. It's like a student guessing "Blue" on a multiple-choice test because they know that's the most common answer, without looking at the diagram.
The Image Cheat: Other times, the AI ignores the question. If the picture shows a giraffe, and the options are "A) Giraffe, B) Car, C) Tree," the AI picks "Giraffe" just because it sees the animal, even if the question was "What is the giraffe eating?" (and the answer is "Leaves").

The paper calls this Intra-modality dependency. It means the model is relying on just one source of information (either the text OR the image) rather than combining them.

2. The "Cat and Mouse" Game

The history of these AI tests is a game of "Cat and Mouse."

The Mouse (The AI): Figures out a shortcut. "Oh, I can just guess based on the question words!"
The Cat (The Researchers): "Aha! We caught you cheating!" They create a new test designed to stop that specific shortcut.
The Mouse (The AI): "Okay, I'll try a different shortcut. Now I'll just guess based on the picture!"

The paper argues that researchers have been so focused on stopping the "text cheating" that they accidentally created tests where the AI just "image cheats." They traded one bad habit for another, never actually testing if the AI can think by combining both.

3. The "Swiss Army Knife" vs. The "Specialized Tool"

The researchers tested 23 different "tests" (benchmarks) using various AI models. They found that these tests are not measuring the same thing.

Some tests are like Swiss Army Knives: They require you to use both the blade and the screwdriver (Image + Text) to solve the problem. These are rare.
Most tests are like Specialized Tools: They only require a hammer (Image) or only a screwdriver (Text).

The paper created a "Spectrum" (a map) to show where each test falls. They found that many tests intended to be "hard" and "multi-modal" are actually just easy "single-modal" tests in disguise.

4. Bigger Isn't Better

You might think, "If we make the AI smarter (bigger models), it will stop cheating and learn to combine clues."
The paper says: Nope.
Making the AI bigger (from 8 billion to 34 billion parameters) didn't fix the cheating. In fact, the bigger models got better at cheating. They became even more efficient at ignoring the picture or ignoring the question and just guessing the right answer based on their "gut feeling" (which is actually just memorized patterns).

5. The "Distraction" Failure

The paper shows examples where the AI fails spectacularly because it's too focused on one thing.

Example: A picture shows a mint plant. The question asks, "What is the temperature of the air?" The AI sees the mint, thinks "Mint is cool," and answers "Cold." It ignored the fact that the question was a trick and the image had nothing to do with temperature.
Example: The question asks about a specific country on a map, but the AI just picks the option that looks like a country name because it's good at reading words, not geography.

The Big Takeaway

The paper concludes that we are stuck in a loop. We keep building new tests, but we aren't measuring what we think we are measuring.

The Problem: We are giving AI a multiple-choice test where the answer is often hidden in just one of the clues.
The Solution: We need to stop just looking at the final score (e.g., "90% accuracy"). Instead, we need to look at how they got the score. Did they use the picture? Did they use the text? Or did they just guess?

In short: We need to stop testing if the AI can "guess the answer key" and start testing if the AI can actually "read the room" by combining what it sees with what it hears. Until we do that, we aren't really measuring "intelligence," we're just measuring "pattern matching."

Here is a detailed technical summary of the paper "Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional."

1. Problem Statement

The rapid proliferation of Multi-modal Large Language Models (MLLMs) has been accompanied by an explosion of evaluation benchmarks (over 200 identified recently). However, the field lacks a systematic understanding of what these datasets actually measure.

The "Cat-and-Mouse" Cycle: Benchmark designers attempt to eliminate "uni-modal shortcuts" (e.g., text-only biases where models guess answers based on question patterns). In response, models exploit new shortcuts (e.g., image-only biases). This iterative cycle creates a false sense of progress, where performance gains may reflect adaptation to benchmark artifacts rather than genuine multi-modal reasoning.
Lack of Characterization: It is unclear whether new datasets offer unique evaluation capabilities or are redundant with existing ones. Current evaluation relies on aggregate scores, which obscure whether a model is truly integrating vision and language or relying on one modality independently.
Gap in Prior Work: Previous studies on modality dependence were limited to synthetic data, smaller-scale benchmarks (like early VQA), or older model generations, failing to address the scale and complexity of modern MLLMs.

2. Methodology

The authors propose a large-scale empirical study to quantify intra-modality and inter-modality dependencies across 23 Visual Question Answering (VQA) benchmarks.

A. Definitions

Intra-modality Dependency (Uniqueness): The ability of a single modality (image or text) to independently predict the target label $y$ .
Inter-modality Dependency (Synergy): The requirement for the joint interaction of both modalities to predict $y$ , where neither modality alone is sufficient.

B. Evaluation Framework (The "Recipe")

To quantify these dependencies, the authors adapt the Perceptual Score method using a systematic input permutation strategy on a baseline ensemble of MLLMs (8B, 13B, and 34B parameters). They evaluate performance under four conditions:

Paired (Normal): Original image and text inputs.
Unimodal (Image Only): The text is replaced with a random text instance from another data point (breaking the semantic link), isolating the image's contribution.
Unimodal (Text Only): The image is replaced with a random image, isolating the text's contribution.
Random: Both modalities are replaced with uncorrelated random instances to establish a baseline.

Key Design Choice: The authors use shuffling (permutation) rather than zeroing out inputs (e.g., blank images). Shuffling preserves the marginal distribution of the data, ensuring the model receives valid inputs while breaking the specific inter-modal dependency, thus providing a more accurate measure of reliance.

C. Scope

Datasets: 23 benchmarks covering General VQA, Expert Knowledge (Science, Math), Spatial Understanding, OCR/Document/Chart understanding.
Models: An ensemble of diverse MLLMs (Cambrian-1, LLaVA-Next, Qwen series) to mitigate model-specific inductive biases.

3. Key Contributions

First Large-Scale Spectrum Analysis: The first systematic quantification of modality dependencies across 23 popular VQA benchmarks using modern MLLMs.
Quantitative Characterization: Introduction of a framework to map datasets onto a spectrum of intra- vs. inter-modality reliance, revealing that datasets are multi-dimensional rather than binary.
Discovery of "Shortcut Swapping": Evidence that efforts to fix text-only biases have inadvertently created strong image-only biases. Many benchmarks designed to force multi-modal reasoning actually allow models to solve tasks using only the image.
Granular Insights: Demonstration that aggregate metrics hide significant intra-modality dependencies within specific sub-categories (e.g., specific question types or subjects) of a single benchmark.

4. Key Results

A. Prevalence of Uni-modal Shortcuts

Rarity of True Inter-modality: Only 4 out of 23 datasets (e.g., POPE, MME, COCO, V*) exhibit a strong requirement for inter-modality dependencies where permuting one modality causes performance to drop to random chance.
Text-Only Dependencies: Most general and expert knowledge benchmarks (e.g., GQA, ScienceQA, MMMU) show strong text-only dependencies. Models can often answer correctly using only the question text, ignoring the image.
Image-Only Dependencies: Benchmarks designed to mitigate text bias (e.g., MMBench, SEED-Bench, TextVQA, ChartQA) often exhibit strong image-only dependencies. Models achieve high accuracy by ignoring the question and relying solely on visual cues.

B. Effect of Model Scaling

Scaling Does Not Fix Biases: Increasing model size (from 8B to 34B) does not eliminate uni-modal dependencies. In fact, larger models often become more adept at exploiting these shortcuts.
Example: On MMMU and MMBench, larger models showed increased reliance on text-only or image-only cues, respectively, while performance on truly inter-modal tasks (POPE) remained stable regardless of scale.

C. Effect of Model Types

Different architectures (LLaVA, Qwen, Cambrian) exhibit similar patterns of bias. While specific performance scores vary, the type of dependency (text vs. image reliance) remains consistent across model families.

D. Sub-category Variance

Aggregate scores are misleading. For example, in ScienceQA, text-only models perform well on higher-grade questions (10-12), while in ADE/COCO, text-only models perform surprisingly well on "relative location" questions despite poor overall scores. This indicates that a single benchmark contains a mix of easy (uni-modal) and hard (inter-modal) samples.

5. Significance and Future Directions

Rethinking Benchmark Design: The paper argues that the goal of benchmark design should not be merely to eliminate one type of bias (e.g., text) but to rigorously measure the interaction between modalities. Current benchmarks often trade one shortcut for another.
Beyond Aggregate Scores: The authors advocate for reporting modality-specific baselines (Image-only, Text-only, Random) alongside standard accuracy. This allows researchers to distinguish between true multi-modal capability and artifact exploitation.
New Evaluation Paradigms:
- Open-Ended Generation: Moving away from multiple-choice formats which encourage guessing.
- Abstention Mechanisms: Models should be trained to abstain when inputs are ambiguous or when one modality is insufficient, rather than forcing a plausible but incorrect answer.
Scientific Rigor: The work highlights that without characterizing the "spectrum" of dependencies, the field cannot accurately measure progress in multi-modal learning, leading to a cycle of superficial improvements.

In conclusion, the paper establishes that no current dataset is truly "multi-modal" in the sense of requiring synergy for all samples. Instead, they are multi-dimensional mixtures of dependencies, and current evaluation protocols fail to capture this complexity, necessitating a fundamental shift in how benchmarks are designed and evaluated.