Are Multimodal Large Language Models Good Annotators for Image Tagging?

Imagine you have a giant photo album with millions of pictures, and you want to teach a computer to recognize what's in them. To do this, you need to write down a list of tags for every single photo (like "dog," "beach," "sunset").

Traditionally, you'd have to hire thousands of people to look at every photo and write these tags by hand. This is expensive, slow, and exhausting.

Enter Multimodal Large Language Models (MLLMs). Think of these as super-smart AI robots that can "see" pictures and "read" text. The big question this paper asks is: "Can we just let the AI do the tagging instead of humans?"

The authors, led by Ming-Kun Xie, say: "Yes, but not just by asking the AI a simple question." If you ask an AI, "What's in this picture?", it might get lazy, miss things, or invent things that aren't there.

So, they built a new system called TagLLM. Here is how it works, explained with some fun analogies:

1. The Problem: The "Lazy Librarian" vs. The "Strict Librarian"

The researchers tested two ways to ask the AI to tag photos:

The "Open-Ended" Ask: "Tell me everything you see."
- Result: The AI acts like a chatty, slightly confused librarian. It might say, "I see a dog, a ball, and... maybe a spaceship?" It invents things (hallucinations) or uses the wrong words.
The "Yes/No" Ask: "Is there a dog? Is there a ball? Is there a spaceship?"
- Result: The AI acts like a strict, boring librarian. It's very accurate when it says "Yes," but it's so cautious it misses a lot of things. If it's not 100% sure, it says "No," even if the dog is there.

The Finding: The AI is actually pretty good (about 50% to 80% as good as a human), but it's not perfect. However, if you train a computer to recognize objects using the AI's tags, that computer performs 90% as well as one trained by humans! Plus, the AI costs almost nothing compared to paying humans.

2. The Solution: The "TagLLM" Two-Step Dance

To get the best of both worlds, the authors created a two-step process called TagLLM. Think of it like a Sieve and a Polisher.

Step 1: The "Wide Net" (Candidate Generation)

Instead of asking the AI to check every single possible word in the dictionary (which would take forever), they use a strategy called "Divide and Conquer."

The Analogy: Imagine you are looking for lost keys in a messy room. Instead of checking every single drawer one by one, you group the drawers: "Kitchen stuff," "Bedroom stuff," "Office stuff." You ask the AI, "Are the keys in the Kitchen group?"
What happens: The AI quickly scans groups of related items (like "kitchen" items) and picks out a short list of possible tags. It casts a wide net to make sure it doesn't miss anything important. This creates a "shortlist" of candidates.

Step 2: The "Truth Detective" (Label Disambiguation)

Now, the AI has a shortlist, but it might still be confused. Maybe the shortlist says "Apple," but the picture is actually a "Red Ball." The AI might be mixing up similar-looking things.

The Analogy: This is like a detective interrogating a suspect. The AI asks itself: "Wait, is this really an Apple? Or is it a Red Ball? Let me look closer."
The Trick: The system uses a second, even smarter AI (like a senior editor) to clarify the definitions. It tells the first AI, "When we say 'Apple,' we mean the fruit, not the phone brand. Make sure you aren't confusing it with a 'Tomato'."
Result: The AI double-checks its shortlist, removes the fake items, and confirms the real ones.

3. The Results: Why This Matters

Cost: Using this AI method costs about 1/1000th of what it costs to hire humans. It's like buying a library card instead of hiring a librarian for every book.
Quality: The final tags are so good that if you use them to train a new AI, that new AI performs almost exactly as well as if it had been trained by humans. In fact, for some tricky categories, the AI actually did a better job than tired human workers who might have made mistakes.
Speed: It's incredibly fast.

The Bottom Line

The paper proves that we don't need to choose between "Cheap but bad" and "Expensive but good." By using a smart, two-step process (First, cast a wide net; Second, double-check the details), we can get human-level quality tags at a fraction of the cost and time.

It's like upgrading from a manual assembly line to a smart factory: the robots do the heavy lifting, but they have a quality control manager checking their work to ensure everything is perfect.

1. Problem Statement

Image tagging is a fundamental computer vision task requiring the annotation of images with all relevant labels (multi-label classification). Traditionally, this relies on human-annotated datasets to train classifiers, which is labor-intensive, expensive, and unscalable, especially for tasks with large label spaces (hundreds or thousands of categories).

While Multimodal Large Language Models (MLLMs) have shown promise in automating annotation, their capability to replace human annotators remains underexplored. Key challenges include:

Annotation Quality Gap: MLLMs often struggle with uncommon or ambiguous categories, leading to lower precision/recall compared to humans.
Prompt Sensitivity: The quality of MLLM output heavily depends on prompt design (format and style).
Concept Misalignment: MLLMs often hallucinate objects or misinterpret category names due to semantic ambiguity (e.g., a category name being too broad or having multiple meanings).

The paper asks: Can MLLMs produce human-level annotations, and if not, how can we bridge the gap to make them a viable, low-cost alternative?

2. Methodology: The TagLLM Framework

The authors propose TagLLM, a novel two-stage framework designed to narrow the gap between MLLM-generated and human annotations. It combines the strengths of different prompting strategies to maximize efficiency and accuracy.

Stage 1: Candidate Generation via Divide-and-Conquer Prompting (DCP)

Goal: Efficiently generate a compact set of candidate labels from a large vocabulary without hallucinating too many false positives.
Strategy: Instead of asking the MLLM to select from the entire vocabulary at once (which causes long prompts and hallucinations), the label space is partitioned into groups.
Technique: The authors employ Co-occurrence Partition (CooP), where categories that frequently appear together are grouped. This encourages "within-group competition," forcing the model to select only the most confident labels for that specific context.
Prompt Format: Uses Multi-Option Prompting (MOP): "What objects are in this image? Candidates: ."
Outcome: Reduces the candidate set size by nearly 20 $\times$ , drastically lowering the computational cost for the next stage.

Are Multimodal Large Language Models Good Annotators for Image Tagging?

1. The Problem: The "Lazy Librarian" vs. The "Strict Librarian"

2. The Solution: The "TagLLM" Two-Step Dance

Step 1: The "Wide Net" (Candidate Generation)

Step 2: The "Truth Detective" (Label Disambiguation)

3. The Results: Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The TagLLM Framework

Stage 1: Candidate Generation via Divide-and-Conquer Prompting (DCP)

Stage 2: Label Refinement via Concept-Aligned Disambiguation (CAD)

3. Key Contributions

4. Experimental Results

5. Significance

Are Multimodal Large Language Models Good Annotators for Image Tagging?

1. The Problem: The "Lazy Librarian" vs. The "Strict Librarian"

2. The Solution: The "TagLLM" Two-Step Dance

Step 1: The "Wide Net" (Candidate Generation)

Step 2: The "Truth Detective" (Label Disambiguation)

3. The Results: Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The TagLLM Framework

Stage 1: Candidate Generation via Divide-and-Conquer Prompting (DCP)

Stage 2: Label Refinement via Concept-Aligned Disambiguation (CAD)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation