MJ1: Multimodal Judgment via Grounded Verification

The paper introduces MJ1, a 3B-parameter multimodal judge that leverages reinforcement learning with a structured grounded verification chain and counterfactual consistency rewards to achieve state-of-the-art accuracy on MMRB2, outperforming significantly larger models by effectively grounding decisions in visual evidence.

Bhavesh Kumar, Dylan Feng, Leonard Tang

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are a judge in a talent show. Two contestants, Alice and Bob, have just performed a magic trick based on a specific request from the audience. Your job is to decide who did it better.

In the world of Artificial Intelligence, this is called Multimodal Judgment. The "contestants" are AI-generated images or text, and the "judge" is another AI model trying to figure out which one is better.

The Problem: The "Distracted Judge"

The paper explains that current AI judges are terrible at this job, not because they aren't smart, but because they are distracted.

Think of an AI judge like a student taking a very long test.

  1. The Beginning: At the start of the test, the student looks at the pictures (the visual evidence) very carefully.
  2. The Middle: As they write their long essay explaining their thoughts, their eyes start to wander. They stop looking at the pictures.
  3. The End: By the time they write their final score, they have completely forgotten what the pictures looked like. Instead, they just guess based on how the text sounds or which answer appeared first.

This is called "Attention Decay." The AI stops "seeing" the images and starts "hallucinating" or guessing based on text patterns.

The Solution: MJ1 (The "Grounded" Judge)

The authors created a new judge called MJ1. Instead of letting the AI write a long essay and then guess a score, they forced it to follow a strict, step-by-step recipe.

Here is how MJ1 works, using a Detective Analogy:

1. The "Crime Scene" Photo (Visual Observation)

Before the detective (the AI) even looks at the suspects' stories, they must first write down exactly what they see in the crime scene photos.

  • Old Way: The detective reads the suspects' stories and then tries to remember the photos.
  • MJ1 Way: The detective must describe the photos first, while their memory is fresh. "I see a red car, a broken window, and a blue hat."

2. The "Alibi" Check (Claim Extraction & Verification)

Next, the detective reads the suspects' stories (the AI responses).

  • Suspect A says: "I was wearing a blue hat."
  • Suspect B says: "I was wearing a green hat."
  • The Verification Step: The detective goes back to their "Crime Scene Photo" notes. "Wait, the photo shows a blue hat. Suspect A is telling the truth. Suspect B is lying."

This forces the AI to constantly check its reasoning against the actual image, preventing it from just making things up.

3. The "Swap Test" (Counterfactual Consistency)

This is the cleverest part. To make sure the judge isn't biased (e.g., always picking the first person they see), the AI is trained with a special trick:

  • Imagine the judge picks Alice as the winner.
  • The trainer then swaps the names. Now Bob is in the first spot and Alice is in the second.
  • If the judge is fair, they should now pick Bob (because Bob is now in the first spot, but the content is the same).
  • If the judge still picks the "first person" regardless of who they are, they fail the test.
  • This teaches the AI to care about what the images show, not where they are sitting.

The Results: Small Brain, Big Smarts

Usually, to get better at a hard job, you need a bigger brain (more computer power). But MJ1 is a "small brain" (only 3 billion active parameters) that beat "giants" like Google's Gemini-3-Pro and GPT-5.

Why?
Because MJ1 doesn't try to be a genius; it tries to be organized. By forcing itself to look at the pictures first, check its facts during the reasoning, and ignore who is sitting in the first chair, it became a much better judge than models that are 100 times larger but just "guess" based on text.

Summary

  • The Problem: AI judges forget the images by the time they give a score.
  • The Fix: Force the AI to describe the images before it starts arguing.
  • The Secret Sauce: A "Swap Test" to ensure the AI isn't just picking the first answer it sees.
  • The Outcome: A small, efficient AI that is smarter and fairer than massive, expensive models.

It's the difference between a student who memorizes the answer key (big models) and a student who actually reads the textbook and checks their work (MJ1).