PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

PRISMM-Bench is the first benchmark grounded in real peer-review flagged inconsistencies across text, figures, tables, and equations in scientific papers, designed to rigorously evaluate and expose the significant limitations of current Large Multimodal Models in detecting, correcting, and reasoning about multimodal scientific errors.

Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin

Published 2026-02-17
📖 5 min read🧠 Deep dive

Imagine you are a senior editor at a prestigious science magazine. Your job is to check the final drafts of research papers before they go to print. You notice something strange: sometimes the text says one thing, but the pictures or charts show something completely different. Maybe the text says a machine learning model is 99% accurate, but the graph shows it failing half the time. Or the text describes a bridge, but the diagram shows a boat.

These are multimodal inconsistencies. They are subtle, confusing, and dangerous because they undermine trust in science.

This paper introduces a new tool called PRISMM-Bench (which sounds like a prism that breaks light into colors, but here it breaks down complex papers into their inconsistencies). Here is the story of what they did, explained simply:

1. The Problem: The "Trust but Verify" Gap

Scientists are starting to use Large Multimodal Models (LMMs)—super-smart AI assistants that can read text and look at pictures—to help write and check papers. The hope is that these AIs can catch errors humans miss.

But here's the catch: Can these AIs actually spot the subtle mismatches between words and images?
Current tests for these AIs are like giving them a math quiz with obvious errors (e.g., "2 + 2 = 5"). Real scientific papers are harder. The errors are like a typo in a contract or a map that doesn't match the terrain. Existing tests didn't capture this real-world messiness.

2. The Solution: Mining the "Trash" for Gold

The researchers needed a way to test these AIs on real mistakes. So, they went digging through the OpenReview database, which is where scientists submit papers and get peer reviews.

  • The Analogy: Imagine a massive library where every book has a "review card" attached. Most cards say, "Great book!" But some cards say, "Hey, the author says the hero is tall in Chapter 1, but the picture in Chapter 5 shows a dwarf."
  • The Process: The team used a computer to scan thousands of these review cards to find the "tall vs. dwarf" complaints. They then had human experts verify them.
  • The Result: They built a dataset of 384 real inconsistencies from 353 scientific papers. This is their "PRISMM-Bench." It's not made up; it's the real stuff reviewers actually flagged.

3. The Test: Three Levels of Difficulty

They didn't just ask the AIs, "Is there a mistake?" They made them play three different games:

  1. The Detective (Identification): "Here is a text and a picture. What is the lie?"
  2. The Doctor (Remedy): "Okay, you found the lie. How do we fix it? Do we change the text or redraw the picture?"
  3. The Matchmaker (Pairing): "Here is a text description. Which of these four pictures contradicts it?"

4. The Sneaky Trick: The "Multiple Choice" Trap

The researchers noticed a funny problem. When they gave the AIs multiple-choice questions (A, B, C, D), the AIs got really good scores. But when they took away the pictures and just showed the text, the AIs still got high scores!

  • The Metaphor: It's like a student taking a test who doesn't read the question. They just look at the answers and say, "Option C is always the longest, so I'll pick C." Or, "Option A sounds fancy, so it must be right." The AI was cheating by guessing patterns in the language, not actually looking at the pictures.

The Fix: To stop this cheating, they changed the format. Instead of long, fancy sentences for the answers, they forced the AI to fill out a JSON form (a structured data box).

  • Old Way: "The figure shows a red car, but the text says blue."
  • New Way: {"Attribute": "Color", "Claim": "Blue", "Evidence": "Red"}

This stripped away the "fancy words" and forced the AI to actually look at the data to fill in the boxes. It was like taking away the student's cheat sheet.

5. The Results: The AI is Still a Rookie

They tested 21 of the smartest AI models in the world (including big names like GPT-5 and Gemini).

  • The Score: Even the best models only got about 54% correct.
  • The Reality Check: A human expert would get much higher scores. The AI models are still struggling to connect the dots between a sentence and a chart. They often miss the subtle clues or get distracted by the sheer volume of text.
  • The "Reasoning" Boost: They found that models that were told to "think step-by-step" (like a human solving a puzzle out loud) did much better than those that just guessed.

The Big Takeaway

This paper is a wake-up call. We want AI to be our scientific assistants, helping us write and check papers. But right now, if you ask an AI to check a paper for contradictions between text and images, it's not reliable yet. It's like hiring a proofreader who sometimes misses typos because they are too busy guessing the answer based on the font size.

PRISMM-Bench is the new "driving test" for these AI assistants. It proves that while they are getting smarter, they still have a long way to go before they can be trusted to keep science honest.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →