Specificity-aware reinforcement learning for fine-grained open-world classification

Imagine you are at a bustling art gallery, and you're asked to describe a painting to a friend over the phone.

The Problem: The "Vague Artist"
Most modern AI image classifiers are like a very polite but slightly lazy artist. If you show them a picture of a Golden Retriever, they might say, "It's a dog." If you show them a Red Delicious Apple, they might say, "It's a fruit."

Technically, they are correct. A Golden Retriever is a dog. But they are too generic. They lack the "specificity" to tell you it's a Golden Retriever or a Red Delicious.

In the real world, this matters. If a doctor needs to know if a skin spot is a "mole" or a specific type of "melanoma," saying "it's a skin spot" isn't helpful. If a botanist needs to identify a rare flower, saying "it's a flower" is useless.

The challenge is: How do we get the AI to be more specific without making it guess and get things wrong?

If you just tell the AI, "Be specific!", it might panic and say, "It's a Golden Retriever named Sparky," when it's actually just a Labrador. It becomes specific but wrong.
If you let it be safe, it stays correct but vague.

The Solution: SpeciaRL (The "Smart Coach")
The authors of this paper created a new training method called SpeciaRL. Think of it as a smart coach training an athlete who already knows the sport but is afraid to take risks.

Here is how it works, using a simple analogy:

1. The "Group Try" (Rollouts)

Instead of asking the AI to guess the answer once, the coach asks it to generate 10 different guesses for the same picture.

Guess 1: "It's a bird." (Too vague)
Guess 2: "It's a sparrow." (Maybe right, maybe wrong)
Guess 3: "It's a White-throated Sparrow." (Very specific, but is it right?)
Guess 4: "It's a bird." (Safe)

2. The "Expert Judge" (The Verifier)

The AI doesn't know which guess is best. So, a super-smart "Judge" (another AI) looks at all 10 guesses and the actual correct answer (which the coach knows).
The Judge sorts the guesses into buckets:

Wrong: "It's a cat." (Discard!)
Generic: "It's a bird." (Okay, but boring.)
Specific: "It's a White-throated Sparrow." (Great!)

3. The "Dynamic Reward" (The Secret Sauce)

This is the most clever part. In the past, AI training was like a strict teacher who only gave a gold star if you got the exact right answer. If you were close but not perfect, you got nothing. This made the AI afraid to try hard.

SpeciaRL changes the rules:
The coach looks at the best guess the AI made in that group of 10.

Scenario A: The AI's best guess was just "Bird."
- The Reward: The coach says, "Good job! You were as specific as you could be. You get a gold star for being a 'Bird'." (The AI learns: Don't force it if I don't know.)
Scenario B: The AI's best guess was "White-throated Sparrow."
- The Reward: The coach says, "Great! You can be specific. Next time, don't settle for just 'Bird.' Aim for the sparrow!" (The AI learns: Push for the specific answer.)

Why This is a Big Deal

Most other methods try to force the AI to be specific by punishing it for being vague. This often makes the AI hallucinate (make things up) just to get the reward.

SpeciaRL is different because it respects the AI's limits.

It asks: "What is the maximum level of detail this AI can actually handle for this specific picture?"
It rewards the AI for reaching that limit, but never for guessing wildly beyond it.

The Result

The paper tested this on thousands of images (birds, cars, flowers, food).

Old AI: "It's a car." (Correct, but boring).
Forced AI: "It's a 1998 Ferrari F355 Challenge." (Specific, but often wrong).
SpeciaRL AI: "It's a 1998 Ferrari F355 Challenge." (Specific and correct).

In a Nutshell

SpeciaRL is a training technique that teaches AI to be confidently specific. It doesn't force the AI to guess; instead, it encourages the AI to dig deep and find the most detailed answer it knows is true, while gently stopping it from making things up when it's unsure. It's the difference between a student who memorizes a textbook and one who truly understands the material and can explain the fine details.

1. Problem Definition

The paper addresses the challenge of fine-grained open-world image classification. Unlike traditional closed-world classification where the label set is fixed, open-world classification requires models to identify objects from an unconstrained, potentially infinite semantic space (e.g., "What is the object in the image?").

The Core Conflict: Recent Large Multimodal Models (LMMs), particularly those with reasoning capabilities (e.g., Qwen2.5VL), exhibit strong visual understanding but tend to produce overly generic predictions (e.g., predicting "flower" instead of "daisy") in open-world settings.
The Trade-off: Naively encouraging models to be more specific (e.g., via prompting "be specific" or standard fine-tuning) often leads to a degradation in correctness, causing the model to hallucinate incorrect specific labels.
The Goal: Develop a method that steers reasoning LMMs toward predictions that are both correct and maximally specific without compromising accuracy.

2. Methodology: SpeciaRL

The authors propose SpeciaRL, a novel Specificity-aware Reinforcement Learning framework. The method leverages the observation that LMMs possess the intrinsic knowledge to be specific but fail to consistently sample the correct reasoning path.

A. Prediction Categorization & Evaluation

To quantify performance, the authors define a hierarchy of six mutually exclusive categories for the relationship between a prediction ( $p$ ) and the ground truth ( $y$ ):

Wrong (W): Incorrect concept.
Abstain (A): Refusal to answer.
Generic (G): Correct but significantly broader (e.g., "dog" vs. "Samoyed").
Less Specific (S-): Correct but a close parent category (e.g., "warbler" vs. "Golden-winged warbler").
Specific (S): Exact match or synonym.
More Specific (S+): A subtype of the ground truth.

Metrics:

Correctness: Percentage of non-Wrong predictions.
Specificity: Average normalized score based on the hierarchy depth of non-Wrong predictions.
Harmonic Mean (HM): The primary metric balancing correctness and specificity.

B. The Reward Mechanism: Dynamic & Sample-Aware

Standard Reinforcement Learning with Verifiable Rewards (RLVR) typically assigns a reward of 1 only if the prediction exactly matches the ground truth. This fails in open-world settings where a "Less Specific" answer might be the best the model can do for a difficult sample.

SpeciaRL's Innovation:
Instead of a static reward, SpeciaRL uses a dynamic, sample-wise reward based on the model's online potential:

Best-of-N (BoN) Estimation: During training, for each input image, the policy model generates $N$ rollouts (e.g., $N=10$ ).
Adaptive Reference ( $c^*$ ): The system identifies the most informative correct category ( $c_{best}$ $c_{b es t}$ ) achieved among the $N$ $N$ rollouts for that specific sample.
- If the best rollout is "Specific", the target reward threshold is "Specific".
- If the best rollout is only "Generic", the target threshold is lowered to "Generic".
Reward Assignment: A prediction receives a positive reward (1) if its category is at least as informative as the adaptive reference $c^*$ $c^{*}$ .
- Logic: This prevents penalizing the model for being generic if the sample is inherently difficult for the model to resolve specifically, while still pushing it to be specific when it is capable of doing so.

C. Optimization

The framework utilizes Group Relative Policy Optimization (GRPO), a popular RL algorithm for LLMs. The dynamic reward is integrated into the GRPO update loop, encouraging the policy to maximize the probability of generating predictions that meet the sample-specific specificity threshold while maintaining correctness.

3. Key Contributions

Problem Formulation: Identified and formalized the "specificity vs. correctness" trade-off in open-world fine-grained classification, demonstrating that existing methods (prompting, SFT, standard RFT) fail to balance this effectively.
Empirical Insight: Proved via "Best-of-N" analysis that reasoning LMMs possess the latent knowledge for fine-grained classification but suffer from sampling inefficiency and a bias toward generic outputs.
SpeciaRL Framework: Introduced a novel RL framework with a verifier-based, dynamic reward signal that adapts to the difficulty of individual samples, preventing correctness degradation.
State-of-the-Art Performance: Demonstrated that SpeciaRL achieves the best trade-off between specificity and correctness across diverse benchmarks, outperforming zero-shot reasoning LMMs and fine-tuned baselines.

4. Experimental Results

Datasets: Evaluated on Fine-grained (Flowers102, Food101, OxfordPets) and Very Fine-grained (StanfordCars, FGVCAircraft) datasets.
Training Setup: Models were trained on a subset of CUB-200-2011 (Birds) and tested on out-of-domain datasets to ensure generalization rather than memorization.
Performance:
- SpeciaRL achieved the highest Harmonic Mean (HM) across all benchmarks.
- On the Fine-grained set, SpeciaRL improved both specificity (0.920 vs. 0.742) and correctness (0.848 vs. 0.846) compared to the base Qwen2.5VL-7B.
- It significantly outperformed "Be specific" prompting (which increased specificity but hurt correctness) and standard Reinforcement Fine-Tuning (RFT).
Qualitative Analysis: Visualizations showed that SpeciaRL not only changed the final label but also improved the reasoning traces, forcing the model to utilize fine-grained visual evidence (e.g., specific wing patterns, car model details) to justify specific predictions.

5. Significance and Impact

Bridging the Gap: SpeciaRL solves a critical limitation in deploying LMMs for real-world applications where generic answers are insufficient (e.g., medical diagnosis, wildlife monitoring, industrial inspection).
Efficient Knowledge Elicitation: It demonstrates that models do not need new knowledge injection (via massive SFT) to be specific; rather, they need optimization strategies to better access their existing latent knowledge.
Robustness: The method is robust across different RL algorithms (GRPO, Dr.GRPO, DAPO) and shows strong cross-domain generalization, suggesting it is a viable strategy for general open-world recognition tasks.
Open Source: The code and models are publicly available, facilitating further research in open-world classification and RL for vision-language models.

In conclusion, SpeciaRL represents a significant step forward in making Large Multimodal Models more reliable and precise for fine-grained tasks, offering a principled way to navigate the tension between being specific and being right.