Multimodal Large Language Models as Image Classifiers

Imagine you are trying to teach a group of very smart, well-read robots (called Multimodal Large Language Models, or MLLMs) how to recognize objects in photos. You show them a picture of a dog and ask, "What is this?"

For a long time, researchers thought these robots were terrible at this specific task compared to older, specialized "vision-only" robots. But this new paper argues that the robots weren't actually failing; the test itself was broken.

Here is the story of how the authors fixed the test and what they discovered, explained with some everyday analogies.

1. The Broken Ruler: Why the Tests Were Faking the Results

Imagine you are taking a math test, but the answer key is full of typos.

The "Ground Truth" Problem: The standard dataset used for these tests (ImageNet) is like a massive library of photos, but many of the labels are wrong. Some photos have two dogs and one cat, but the label only says "dog." Some photos are blurry or ambiguous.
The Result: When the smart robots tried to answer, they were often right, but the test marked them wrong because the "correct" answer in the book was actually a mistake.
The Fix: The authors went through 625 categories of images and re-labeled them carefully (creating ReGT). It's like hiring a team of expert editors to fix all the typos in the answer key.
The Surprise: Once they fixed the answer key, the robots' scores jumped up dramatically (by up to 10%). The gap between the "smart robots" and the "specialized vision robots" almost disappeared. It turns out the robots weren't dumb; they were just being graded on a broken test.

2. The Three Ways to Ask the Question

The paper also looked at how we ask the robots to classify images. They tested three different "game modes":

Mode A: The Open-World (The Free-Form Essay)
- The Setup: You show a picture and say, "Tell me what you see." The robot writes a sentence like, "I see a golden retriever playing in the park."
- The Problem: How do you grade an essay? You have to match "golden retriever" to the list of 1,000 allowed answers.
- The Discovery: The authors found that if you use a smart "translator" (embedding space) to match the robot's sentence to the closest allowed answer, the robots actually do better here than in other modes. Previous studies failed because they used a clumsy "search and replace" method that missed the nuance.
Mode B: Multiple Choice (The Quiz Show)
- The Setup: You show a picture and ask, "Is it a cat, a dog, a car, or a toaster?"
- The Problem: In many past tests, the wrong answers (distractors) were too easy. Asking "Is this a cat or a toaster?" is too easy for a smart robot.
- The Discovery: When the authors made the wrong answers harder (e.g., "Is this a Golden Retriever or a Labrador?"), the robots' scores dropped significantly. This proves that previous studies were inflating the robots' abilities by giving them easy quizzes.
Mode C: Closed-World (The Strict List)
- The Setup: You give the robot a list of all 1,000 possible answers and say, "Pick exactly one from this list."
- The Problem: Sometimes the robot gets confused and says something not on the list (like "a puppy" when the list only has "dog"). In the past, this was counted as a failure.
- The Fix: The authors introduced CW+. If the robot says "puppy," the system automatically maps it to the closest valid answer on the list ("dog") instead of just marking it wrong. This fixed a major source of "false failures."

3. The "Batch" Effect: Why Order Matters

Imagine you are a teacher grading a stack of 10 exams.

If the first exam is a picture of a cat, and you are tired or distracted, you might subconsciously look at the next 9 exams and think, "These all look like cats too," even if they aren't.
The paper found that when robots process images in batches (groups), they sometimes get "stuck" on the first image's label and apply it to the rest of the group.
The Lesson: To get a fair score, you must shuffle the images randomly so the robot doesn't get "lazy" and guess the same answer for everything in the batch.

4. The Robots as Teaching Assistants

Finally, the authors asked: Can these robots help humans?

They took the images where the robots disagreed with the human experts.
They showed these tricky images to a second team of human annotators, along with the robot's guess.
The Result: In about 50% of the difficult cases, the humans agreed with the robot and changed their own answer.
The Metaphor: Think of the robot not as the final judge, but as a super-attentive intern. It spots mistakes the human supervisors missed. If you use the robot to flag potential errors, you can curate much better datasets.

The Big Takeaway

This paper is a wake-up call for the AI community.

Don't trust the old scores: Many MLLMs were unfairly rated as "bad at classification" because the test data was noisy and the evaluation methods were flawed.
Fix the data first: Before blaming the model, check if your "answer key" is correct.
Be careful with the test format: How you ask the question (Open vs. Closed vs. Multiple Choice) changes the score more than the model's actual intelligence.

In short: The robots are smarter than we thought, but we were asking them the wrong questions and grading them with a broken ruler. Once we fixed the ruler, they passed with flying colors.

1. Problem Statement

The paper addresses the conflicting conclusions in existing literature regarding the image classification performance of Multimodal Large Language Models (MLLMs) compared to traditional supervised computer vision models and Vision-Language Models (VLMs).

Inconsistent Benchmarks: Previous studies report that MLLMs either significantly underperform or match classical models. The authors argue these discrepancies stem from flawed evaluation protocols and noisy ground truth rather than genuine model deficiencies.
Evaluation Protocol Issues:
- Open-World (OW): Often relies on simple string matching or text inclusion, which underestimates performance.
- Multiple-Choice (MC): Frequently uses weak distractors, inflating performance and providing an overly optimistic view.
- Closed-World (CW): Historically avoided for MLLMs due to input token limits and "Out-of-Prompt" (OOP) hallucinations (where the model generates a label not in the provided list), leading to unfair penalization.
Ground Truth Noise: The standard ImageNet-1k validation set contains significant label errors (approx. 20%), including multi-object images labeled as single-class, overlapping definitions, and distribution shifts. This noise disproportionately affects models that do not rely heavily on supervised training signals.

2. Methodology

The authors propose a rigorous, unified evaluation framework to compare MLLMs, VLMs, and supervised models on ImageNet-1k.

A. Dataset: ReGT (Re-annotated Ground Truth)

Scope: Re-annotated 625 classes of the ImageNet-1k validation set (31,250 images).
Strategy: Annotators were trained to identify and avoid common errors. They labeled all objects present (multilabel) or indicated if no valid class existed.
Exclusions: Fine-grained wildlife categories requiring expert knowledge were excluded from the main set but evaluated separately in a case study (Mustelidae family).
Categories: Images were categorized based on agreement between original labels (ImGT) and new labels (ReGT):
- S (Single-label): Images with one object.
- M (Multi-label): Images with multiple objects.
- N (None): Images with no valid ImageNet label.
- Agreement/Disagreement: Subsets where ReGT matches or conflicts with ImGT.

B. Evaluation Tasks & Protocols

The paper evaluates three task formulations within a single framework:

Open-World (OW): The model generates a free-form description.
- Innovation: Uses text-embedding space nearest-neighbor matching (instead of string matching) to map the generated text to the closest class name. This significantly improves OW performance.
Multiple-Choice (MC): The model selects from $n$ $n$ options (1 correct + $n-1$ $n - 1$ distractors).
- Innovation: Introduces harder distractors sampled based on the confusion matrix of a supervised model (EVA-02) or BERT embeddings, rather than random sampling.
Closed-World (CW) & CW+: The model must choose from the full list of 1,000 classes.
- Innovation (CW+): To handle Out-of-Prompt (OOP) predictions (hallucinations), the authors map the model's free-text output to the nearest class in the embedding space. This allows for full 1,000-class evaluation without costly constrained decoding.

C. Models Evaluated

MLLMs: GPT-4o (closed-source), Qwen3-VL, LLaVA-OneVision, InternVL3.5, PaliGemma 2.
VLMs: SigLIP (various sizes), DINOv3 (zero-shot and k-NN).
Supervised Models: EfficientNet-L2, EfficientNetV2-XL, EVA-02.

D. Ablation Studies

The authors systematically analyzed design choices often overlooked in prior work:

Batch Size: Impact of processing 1 vs. 5 vs. 10 images per request.
Image Ordering: Random vs. class-homogeneous batching.
Text Encoders: Selection of the best encoder for embedding-based mapping.
Prompt Design: Class names vs. Class IDs.

3. Key Contributions

Improved Benchmarking Framework: Introduced CW+, a lightweight post-processing method that resolves OOP predictions via embedding mapping, enabling fair full-scale Closed-World evaluation.
ReGT Dataset: A high-quality, multilabel re-annotation of 625 ImageNet classes that significantly reduces label noise and ambiguity.
Quantification of Label Noise Sensitivity: Demonstrated that MLLMs are highly sensitive to ground truth quality. Correcting labels improves MLLM accuracy by up to +10.8%, whereas supervised models show minimal gains.
Protocol Sensitivity Analysis: Showed that evaluation protocols (distractor difficulty, batch size, ordering) drastically alter reported accuracy. For instance, using confusion-matrix-based distractors caused a 10–15% drop in accuracy compared to random distractors.
MLLMs as Annotation Assistants: A case study showed human annotators accepted or integrated MLLM predictions in ~50% of difficult, disputed cases, proving their utility in dataset curation.

4. Key Results

Performance Gap Narrowing: On the original ImGT, MLLMs lagged behind supervised models. On ReGT, the gap narrowed significantly. For example, GPT-4o's performance improved by 6.0%, while PaliGemma 2 improved by 10.8%.
Task Performance:
- CW+: Consistently outperformed standard CW, especially on challenging subsets (S- and M-).
- OW vs. CW: Contrary to prior work, some models (LLaVA-OV, Qwen3-VL) performed better in Open-World settings than Closed-World when using embedding mapping.
Distractor Impact: Performance dropped significantly when using harder distractors (e.g., GPT-4o dropped from ~99% with random distractors to ~85% with EVA-02 confusion-based distractors).
Model Sensitivity: Models less reliant on supervised training signals (MLLMs, VLMs) showed the largest performance gains from label correction, indicating they are more robust to the semantic content of images but more sensitive to annotation noise than supervised models.
Hallucination Rates: MLLMs frequently generated OOP predictions (up to 26% for LLaVA-OV), particularly on multi-label images or images where ImGT was incorrect. CW+ successfully mapped a significant portion of these to correct classes.

5. Significance and Conclusion

Re-evaluation of MLLM Capabilities: The paper concludes that the perceived underperformance of MLLMs in classification is largely an artifact of noisy ground truth and suboptimal evaluation protocols, not inherent model failure.
Best Practices: The authors provide concrete guidelines for future evaluations: use embedding-based mapping for OW/CW+, employ hard distractors for MC, and ensure random image ordering within batches to avoid bias.
Human-AI Collaboration: The study validates MLLMs as effective tools for dataset curation, capable of flagging residual annotation errors that human annotators miss, suggesting a future workflow where MLLMs assist in creating cleaner benchmarks.
Future Directions: Highlights the need for cleaner benchmarks and principled evaluation protocols to accurately assess the true potential of multimodal models in visual recognition tasks.