Large Multimodal Models as General In-Context Classifiers

The Big Question: Who is the Better Classifier?

Imagine you have two types of experts trying to identify objects in a photo:

The "Specialist" (CLIP/VLM): Think of this as a librarian who has memorized a specific list of book titles. If you show them a picture of a cat, they instantly check their mental list: "Is it a dog? No. Is it a cat? Yes!" They are incredibly fast and accurate if the answer is on their list. However, if you ask them about something obscure or not on their list, they get stuck.
The "Generalist" (LMM): Think of this as a creative storyteller. They can describe a picture in rich detail, tell a story about it, and answer complex questions. But when it comes to simple classification (just naming the object), they often ramble, guess too broadly, or get confused.

The Old Belief: Researchers used to think the Librarian (Specialist) was always better at naming things, and the Storyteller (Generalist) was too messy for simple tasks.

The New Discovery: This paper argues that the Storyteller is actually a hidden genius, but only if you give them the right context.

Part 1: The Power of "Context" (The Study Buddy)

The paper introduces a concept called In-Context Learning (ICL).

The Analogy: Imagine you are taking a test.
- Zero-Shot (No Context): You walk into the exam room alone. You have to guess the answer based on your training.
- In-Context (With Examples): You are allowed to sit next to three other students who have already solved similar problems. You can look at their work to understand how to solve the current problem.

The Finding:
When the Storyteller (LMM) is given a few examples of "This is a cat, this is a dog" right before the test, they suddenly become just as good as, or even better than, the Librarian (CLIP). The examples act as a "cheat sheet" that helps the Storyteller focus and stop rambling.

Part 2: The Open-World Problem (The "What is this?" Mystery)

The Librarian (CLIP) has a major flaw: they can only pick from a pre-written list. If you show them a picture of a "Golden Retriever" but their list only says "Dog," they might get it right. But if you show them a "Golden Retriever" and their list says "Cat, Dog, Car," they might fail if the list isn't perfect.

The Storyteller (LMM) is great at the "Open World" because they can just say, "That looks like a Golden Retriever!" without needing a pre-defined list.

The Problem:
When the Storyteller tries to do this alone, they often hallucinate. They might say, "It's a dog, a puppy, a pet, a furry friend, and maybe a golden retriever?" It's too vague.

The Solution: CIRCLE
The authors created a method called CIRCLE (Iteratively Refines Contextual Learning Examples).

The Analogy: Imagine the Storyteller is trying to solve a mystery, but they don't have a reference book.
1. Step 1: They look at a pile of mystery photos and make a guess at what each one is (Pseudo-labeling).
2. Step 2 (The Magic): They take those guesses and say, "Okay, if this photo is a 'boat,' then that photo must be a 'ferry,' not just a 'vehicle'." They use the group of photos to correct each other.
3. Step 3: They repeat this process, refining their guesses until the whole group agrees on a consistent, precise description.

The Result:
By letting the Storyteller "teach itself" using the context of the other images, CIRCLE turns the messy Storyteller into a precise detective. In the "Open World" (where there is no fixed list of answers), this method beats the Librarian every time.

Summary of the "Aha!" Moments

Don't judge a book by its cover (or a model by its zero-shot score): Large Multimodal Models (LMMs) aren't bad at classification; they just need a little help (examples) to get started.
Context is King: Giving an LMM a few examples (In-Context Learning) makes it perform miracles, often surpassing the specialized models designed just for classification.
Self-Correction is Key: In the messy, open world where there are no answer keys, the best way to get the right answer is to let the model look at the whole group of images and refine its own guesses until they make sense together. This is what CIRCLE does.

The Takeaway

The paper suggests that in the future, we might not need two different types of AI (one for talking, one for classifying). We might just need one Super-Generalist (the LMM) that, when given a few examples and a chance to "think" about the context, can do everything perfectly—naming objects, describing scenes, and solving complex visual puzzles.

1. Problem Statement

The paper addresses a fundamental question in computer vision: Are Large Multimodal Models (LMMs) inferior to Contrastive Vision-Language Models (VLMs) like CLIP for image classification tasks?

Current Consensus: Previous studies suggest that contrastive VLMs (e.g., CLIP, SigLIP) are superior for classification due to their strong zero-shot performance and ability to measure similarity between image and text embeddings. LMMs (generative models) are typically viewed as better suited for complex reasoning or open-ended generation but often underperform in discriminative classification tasks.
The Gap: This consensus largely overlooks the capability of In-Context Learning (ICL). While VLMs rely on caching mechanisms (like Tip-Adapter) to incorporate few-shot examples, LMMs can natively process examples within their context window. The authors hypothesize that LMMs are not inherently worse classifiers but are simply not properly conditioned for the task.
The Challenge: In Open-World Classification (OWC), where class labels are not predefined, standard ICL fails because LMMs struggle to generate consistent pseudo-labels for context examples without supervision, leading to noise and hallucination.

2. Methodology

The paper proposes a two-pronged approach: a systematic benchmark of LMMs with ICL in closed-world settings and a novel training-free method for open-world settings.

A. Closed-World Classification Benchmark

The authors compare Contrastive VLMs (using Tip-Adapter for few-shot learning) against Generative LMMs (using Vanilla ICL).

Setup: They evaluate models on 10 diverse datasets (e.g., Caltech101, Flowers102, Stanford Cars) across varying shot counts (4, 8, 16).
Mechanism:
- VLMs: Use a cache-based approach where logits are refined by adding the similarity between the query image and cached support images.
- LMMs: Feed support images and their labels directly into the model's context window as a "Vanilla ICL" setup, formatted as a Multiple-Choice Question (MCQ).
Key Finding: While VLMs start with higher zero-shot accuracy, LMMs exhibit significantly higher sample efficiency. As the number of in-context examples increases, LMMs rapidly close the gap and can match or surpass VLMs.

B. Open-World Classification: The CIRCLE Method

To address the failure of naive ICL in open-world settings (where labels are unknown), the authors propose CIRCLE (Context Iteratively Refines Contextual Learning Examples).

Problem: Naive pseudo-labeling (generating a label for a context image and using it as context) often introduces noise because the model lacks a reference frame, leading to inconsistent granularity (e.g., predicting "flower" vs. "rose").
CIRCLE Algorithm:
1. Initialization: The model generates initial pseudo-labels for a set of $m$ unlabeled context images.
2. Iterative Refinement (Leave-One-Out): For each context image $x_j$ , the model is prompted with the other $m-1$ images and their current pseudo-labels as the context.
3. Update: The model generates a new, context-aware pseudo-label for $x_j$ . This process forces the model to align the granularity and semantics of the labels across the entire set.
4. Iteration: Steps 2 and 3 are repeated for $T$ rounds until the pseudo-labels stabilize.
5. Classification: The refined context (images + refined pseudo-labels) is used to classify the query image.
Advantage: This method is training-free and requires no human annotations. It leverages the LMM's generative nature to self-correct and establish a consistent semantic structure for the open world.

3. Key Contributions

Systematic Analysis of ICL in LMMs: The first comprehensive benchmark showing that LMMs, when equipped with sufficient in-context examples, can match or exceed the performance of state-of-the-art contrastive VLMs in closed-world classification, challenging the assumption that VLMs are inherently superior for discriminative tasks.
CIRCLE Framework: Introduction of a novel, training-free method that iteratively refines pseudo-labels for unlabeled context images. This enables LMMs to perform robust open-world classification without predefined class lists.
Superiority in Open-World Settings: Demonstration that while naive ICL degrades performance in open-world scenarios, CIRCLE consistently outperforms both the base LMM and VLM counterparts (like CaSED) across correctness (Llama Inclusion) and semantic relevance metrics.

4. Experimental Results

Closed-World Results

Sample Efficiency: LMMs show massive gains with increased context. For example, Phi-3.5-Vision improved by +29.2% and Qwen2-VL-7B by +17.7% (average across datasets) when moving from 0-shot to 16-shot.
Performance Parity: Strong LMMs (e.g., Qwen2-VL-7B) with 16 shots matched the performance of the strongest contrastive VLM (CLIP ViT-L/14).
Comparison: LMMs with ICL significantly outperformed simple k-NN retrieval and often surpassed Tip-Adapter in fine-grained tasks.

Open-World Results

Baseline Failure: Naive ICL and Random Context often degraded performance compared to Zero-Shot baselines (e.g., LLaVa OneVision dropped from 53.2 LI to 14.0 with Random context).
CIRCLE Success: CIRCLE consistently achieved the highest scores across all metrics:
- Llama Inclusion (LI): Improved Qwen2.5-VL from 82.9 (Zero-Shot) to 94.9.
- Semantic Similarity (SS) & Median Concept Similarity (mCS): CIRCLE significantly boosted semantic relevance, ensuring the model didn't just guess a label but provided a coherent description.
- Robustness: CIRCLE outperformed VLM baselines (CaSED) and other ICL variants across all dataset categories (Prototypical, Non-prototypical, Fine-grained, Very Fine-grained).

Streaming & Ablation

Streaming: CIRCLE remained robust in online streaming scenarios where context is built dynamically from test data.
Ablation: Increasing the number of refinement rounds (up to 4) yielded diminishing returns but consistently outperformed single-round pseudo-labeling.

5. Significance and Implications

Paradigm Shift: The paper challenges the prevailing view that LMMs are weak discriminators. It suggests that with proper conditioning (ICL), LMMs are versatile "general classifiers" capable of handling both closed and open-world tasks.
Unified Architecture: It proposes a path toward using a single LMM for both complex reasoning and precise classification, reducing the need for specialized models (like CLIP) or complex adapter training pipelines.
Training-Free Adaptation: CIRCLE demonstrates that high-performance adaptation to new domains or open-world tasks can be achieved without fine-tuning, relying solely on the model's internal reasoning capabilities and iterative self-refinement.
Future Direction: The work highlights the potential of LMMs to serve as context builders for other models, suggesting a future where generative models drive discriminative tasks through sophisticated context curation.

In conclusion, the paper establishes that context is the key differentiator. While VLMs are optimized for static similarity matching, LMMs possess a latent ability to learn from context that, when unlocked via methods like CIRCLE, allows them to surpass traditional VLMs in both accuracy and flexibility.