MJ1: Multimodal Judgment via Grounded Verification

Imagine you are a judge in a talent show. Two contestants, Alice and Bob, have just performed a magic trick based on a specific request from the audience. Your job is to decide who did it better.

In the world of Artificial Intelligence, this is called Multimodal Judgment. The "contestants" are AI-generated images or text, and the "judge" is another AI model trying to figure out which one is better.

The Problem: The "Distracted Judge"

The paper explains that current AI judges are terrible at this job, not because they aren't smart, but because they are distracted.

Think of an AI judge like a student taking a very long test.

The Beginning: At the start of the test, the student looks at the pictures (the visual evidence) very carefully.
The Middle: As they write their long essay explaining their thoughts, their eyes start to wander. They stop looking at the pictures.
The End: By the time they write their final score, they have completely forgotten what the pictures looked like. Instead, they just guess based on how the text sounds or which answer appeared first.

This is called "Attention Decay." The AI stops "seeing" the images and starts "hallucinating" or guessing based on text patterns.

The Solution: MJ1 (The "Grounded" Judge)

The authors created a new judge called MJ1. Instead of letting the AI write a long essay and then guess a score, they forced it to follow a strict, step-by-step recipe.

Here is how MJ1 works, using a Detective Analogy:

1. The "Crime Scene" Photo (Visual Observation)

Before the detective (the AI) even looks at the suspects' stories, they must first write down exactly what they see in the crime scene photos.

Old Way: The detective reads the suspects' stories and then tries to remember the photos.
MJ1 Way: The detective must describe the photos first, while their memory is fresh. "I see a red car, a broken window, and a blue hat."

2. The "Alibi" Check (Claim Extraction & Verification)

Next, the detective reads the suspects' stories (the AI responses).

Suspect A says: "I was wearing a blue hat."
Suspect B says: "I was wearing a green hat."
The Verification Step: The detective goes back to their "Crime Scene Photo" notes. "Wait, the photo shows a blue hat. Suspect A is telling the truth. Suspect B is lying."

This forces the AI to constantly check its reasoning against the actual image, preventing it from just making things up.

3. The "Swap Test" (Counterfactual Consistency)

This is the cleverest part. To make sure the judge isn't biased (e.g., always picking the first person they see), the AI is trained with a special trick:

Imagine the judge picks Alice as the winner.
The trainer then swaps the names. Now Bob is in the first spot and Alice is in the second.
If the judge is fair, they should now pick Bob (because Bob is now in the first spot, but the content is the same).
If the judge still picks the "first person" regardless of who they are, they fail the test.
This teaches the AI to care about what the images show, not where they are sitting.

The Results: Small Brain, Big Smarts

Usually, to get better at a hard job, you need a bigger brain (more computer power). But MJ1 is a "small brain" (only 3 billion active parameters) that beat "giants" like Google's Gemini-3-Pro and GPT-5.

Why?
Because MJ1 doesn't try to be a genius; it tries to be organized. By forcing itself to look at the pictures first, check its facts during the reasoning, and ignore who is sitting in the first chair, it became a much better judge than models that are 100 times larger but just "guess" based on text.

Summary

The Problem: AI judges forget the images by the time they give a score.
The Fix: Force the AI to describe the images before it starts arguing.
The Secret Sauce: A "Swap Test" to ensure the AI isn't just picking the first answer it sees.
The Outcome: A small, efficient AI that is smarter and fairer than massive, expensive models.

It's the difference between a student who memorizes the answer key (big models) and a student who actually reads the textbook and checks their work (MJ1).

1. Problem Statement

Multimodal judges (models designed to evaluate generated images against prompts or compare multiple image outputs) currently suffer from a critical bottleneck: visual grounding failure. Despite the availability of large-scale Vision-Language Models (VLMs), state-of-the-art models (e.g., Gemini-3-Pro, GPT-5) achieve only ~70–76% accuracy on benchmarks like Multimodal RewardBench 2 (MMRB2).

The core issue is attention decay. Research indicates that in deep transformer layers, visual tokens receive vanishingly small attention weights compared to text tokens. As generation proceeds, the model's focus shifts almost entirely to language priors, causing it to ignore visual evidence. Consequently, judges often rely on text-based shortcuts (fluency, length, formatting) rather than analyzing the actual visual content, leading to hallucinations and position bias (preferring the first response regardless of content).

2. Methodology: MJ1

The authors propose MJ1, a reinforcement learning (RL) trained multimodal judge built on the Qwen3-VL-30B-A3B architecture (30B total parameters, 3B active). The approach consists of two primary innovations:

A. Grounded Verification Chain

To combat attention decay, MJ1 enforces a structured, multi-stage reasoning process that forces the model to attend to visual evidence before making a final judgment. The generation follows a strict sequence:

Observations ( $O$ ): The model first extracts visual descriptions from the prompt and candidate images ( $p, R_A, R_B$ ) while visual attention is highest.
Claims ( $C$ ): The model decomposes the text responses into specific claims.
Verification ( $V$ ): Each claim is verified against the previously extracted observations, producing a binary signal (consistent/inconsistent).
Evaluation ( $E$ ): Responses are evaluated against task-specific criteria.
Scoring ( $s$ ): Final integer scores are assigned.

This structure ensures that the final score is mathematically dependent on the initial visual observations, preventing the model from skipping visual analysis.

B. Counterfactual Consistency Reward

To eliminate positional bias (the tendency to prefer Response A simply because it appears first), the authors introduce a counterfactual consistency reward ( $R_{cons}$ ) during Reinforcement Learning with Group Relative Policy Optimization (GRPO).

Mechanism: For every training sample, the model generates a standard judgment. Then, the inputs are swapped ( $R_A \leftrightarrow R_B$ ), and the model is asked to regenerate the judgment only for the evaluation and scoring stages (keeping the reasoning chain consistent with the swap).
Reward Logic: If the model's preference correctly inverts when the inputs are swapped, it receives a reward ( $R_{cons}=1$ ). If it maintains the same preference despite the content swap, it is penalized.
Effect: This forces the model to base its decision on the content of the images and responses rather than their position in the prompt.

C. Training Pipeline

The training follows a two-phase approach:

Cold-Start SFT: Fine-tuning on 10k distilled reasoning traces to establish the structured format (XML tags) and basic judgment capability.
GRPO Optimization: A composite reward function is used:
$R(J) = R_{format} + R_{correct} + R_{cons}$
- $R_{format}$ : Validates XML structure.
- $R_{correct}$ : Checks if the preference matches the ground truth.
- $R_{cons}$ : Enforces position invariance via the counterfactual flip.

3. Key Contributions

Structured Grounding: Demonstrated that a structured verification chain (Observations $\to$ Claims $\to$ Verification) significantly improves accuracy even without training (+3.8% on Image Editing, +1.7% on Reasoning on base models).
Consistency-Based Training: Introduced a counterfactual consistency reward that effectively eliminates positional bias in multimodal settings, a problem previously unaddressed in text-only judge training.
Efficiency: Proved that high-quality multimodal judgment does not require massive model scales. A 3B active parameter model can outperform models with orders of magnitude more parameters through better training recipes and architectural constraints.

4. Results

MJ1 was evaluated on MMRB2, a comprehensive benchmark with four subtasks: Text-to-Image, Image Editing, Interleaved Generation, and Multimodal Reasoning.

Performance: MJ1 achieved 77.0% overall accuracy, surpassing the previous state-of-the-art, Gemini-3-Pro (76.3%), and all other open-source and API-based models.
Parameter Efficiency: MJ1 uses only 3B active parameters, significantly outperforming much larger models like Qwen3-VL-235B-A22B (62.9%) and GPT-5 (72.2%).
Ablation Insights:
- Zero-Shot Improvement: Applying the grounded verification prompt to an untrained base model immediately improved accuracy, confirming that the structure itself mitigates attention decay.
- Consistency Correlation: Experiments showed that models with high consistency rewards ( $R_{cons}$ ) also exhibited better visual grounding. When images were shuffled or removed, consistency dropped significantly, proving the reward mechanism effectively penalizes hallucination and reliance on text-only cues.

5. Significance

This paper challenges the prevailing notion that scaling model size is the primary solution for improving multimodal reasoning and judgment. Instead, it highlights that mechanical failures in attention mechanisms and training biases are the true bottlenecks.

By decoupling visual observation from final scoring and enforcing logical consistency through counterfactual testing, MJ1 demonstrates that training recipes and structural constraints are more critical than parameter count for alignment and evaluation tasks. This approach offers a scalable, cost-effective path to building high-fidelity multimodal judges for RLHF and automated benchmarking, potentially revolutionizing how vision-language models are aligned and improved.