Conformal Prediction for Long-Tailed Classification

Imagine you are an amateur botanist trying to identify a strange plant you found in your backyard. You snap a photo and upload it to an AI app.

The Problem:
The AI doesn't just guess one plant; it gives you a list of possibilities (a "prediction set"). This is helpful because the AI might be unsure.

If the list has 1,000 plants, it's useless. You'd spend all day checking them.
If the list has 1 plant, it's risky. If the AI is wrong, you miss the answer entirely.

Now, imagine the world of plants is long-tailed. This means there are a few super-common plants (like Dandelions) that the AI sees millions of times, and thousands of rare, endangered plants that the AI has only seen a handful of times.

The Current Dilemma:
Existing AI tools force you to choose between two bad options:

The "Safe but Useless" List: The AI guarantees it won't miss the rare plants, but to do so, it dumps every single plant in the world into your list. You can't use it.
The "Small but Dangerous" List: The AI gives you a short list of 2 or 3 plants. It's easy to check, but it almost always misses the rare, endangered species because it's never seen them before.

The Solution: A New Way to Balance the Scale
The authors of this paper propose two clever tricks to get the best of both worlds: a short, manageable list that still catches the rare plants.

Trick #1: The "Popularity Discount" (Prevalence-Adjusted Softmax)

Imagine the AI is a judge at a talent show.

Old Way: The judge gives a high score to a famous pop star (common plant) and a low score to an unknown indie band (rare plant). The judge's list is dominated by the pop stars.
New Way (PAS): The judge realizes, "Wait, I've seen the pop star a million times, so I'm not impressed. But I've barely seen this indie band, so if they show up, they must be special!"
The Metaphor: The AI applies a "Popularity Discount." It lowers the score of common plants (because they are easy to guess) and boosts the score of rare plants (because they are hard to guess).
The Result: When the AI makes its list, it doesn't just pick the "most likely" plants; it picks the plants that are "surprisingly likely" given how rare they are. This keeps the list short but ensures the rare plants aren't ignored.

Trick #2: The "Dimmer Switch" (INTERP-Q)

Imagine you have two lights:

Light A (Standard): Very bright, but only shines on the common plants.
Light B (Classwise): A floodlight that shines on everything, including the rare plants, but it's so bright it blinds you with too many options.

The New Method:
Instead of choosing one light, the authors built a dimmer switch.

You can slide the switch to mix the two lights.
Slide it a little toward the floodlight? You get a slightly longer list, but now the rare plants are visible.
Slide it back? The list gets shorter again.
The Benefit: You (the user) get to decide exactly how much "rare plant safety" you want versus how "short" you want your list to be. It's a smooth dial, not a binary on/off switch.

Why Does This Matter?

This isn't just about plants. It applies to:

Medicine: Finding a rare, aggressive cancer (the "rare plant") is more important than classifying a common cold (the "dandelion"). We don't want the AI to ignore the cancer just because it's rare.
AI Safety: If we ignore rare classes in AI training, the AI eventually "forgets" them and gets worse over time (a phenomenon called "model collapse").

The Bottom Line

The paper teaches us how to build AI that doesn't just play it safe with the common stuff. By adjusting how the AI "sees" rarity, we can create prediction lists that are short enough to be useful but inclusive enough to catch the rare, important things we care about most. It's like having a flashlight that is bright enough to see the rare gems in the dark, without blinding you with the whole room.

1. Problem Statement

The paper addresses the challenge of applying Conformal Prediction (CP) to long-tailed classification problems, where class distributions are highly skewed (e.g., plant identification with thousands of rare species and few common ones).

The Goal: Construct prediction sets $C(X)$ that contain the true label $Y$ with high probability.
The Conflict: In long-tailed settings, there is a severe trade-off between set size and class-conditional coverage:
- Standard CP: Guarantees marginal coverage (average over all classes) but produces small sets. However, it often fails to cover rare classes (poor class-conditional coverage), systematically omitting them.
- Classwise CP: Guarantees class-conditional coverage for every class. However, for rare classes with few calibration examples, the required thresholds become infinite or extremely large, resulting in prediction sets that are impractically large (e.g., containing hundreds of classes).
The Gap: Existing methods force a binary choice: small sets with poor rare-class coverage OR large sets with good coverage. The paper aims to find a "middle ground" that maintains reasonable set sizes while ensuring rare classes are not systematically omitted.

2. Methodology

The authors propose two distinct approaches to navigate the size-coverage trade-off, both maintaining marginal coverage guarantees.

Approach I: Prevalence-Adjusted Softmax (PAS)

This approach targets Macro-Coverage (the unweighted average of class-conditional coverages) rather than marginal coverage.

Theoretical Basis: The authors derive an "oracle" optimal prediction set that minimizes expected set size subject to a macro-coverage constraint. The optimal solution involves thresholding the ratio of the conditional probability to the marginal probability: $p(y|x) / p(y)$ .
The Score Function: They introduce a new conformal score function called Prevalence-Adjusted Softmax (PAS):
$s_{PAS}(x, y) = -\frac{\hat{p}(y|x)}{\hat{p}(y)}$
Where $\hat{p}(y|x)$ is the model's predicted probability and $\hat{p}(y)$ is the estimated prevalence of class $y$ in the training data.
Mechanism: By using PAS as the score function within a Standard CP framework (using a single global quantile threshold), the method effectively down-weights common classes and up-weights rare classes. This allows rare classes to be included in the prediction set without inflating the set size for common classes as drastically as Classwise CP does.
Extension (WPAS): A weighted variant (Weighted PAS) allows users to prioritize specific classes (e.g., endangered species) by assigning them higher weights $\omega(y)$ .

Approach II: Interpolated Quantile (INTERP-Q)

This approach targets class-conditional coverage but "softens" the strict requirements of Classwise CP.

Mechanism: It linearly interpolates between the quantile thresholds of Standard CP ( $\hat{q}$ ) and Classwise CP ( $\hat{q}^{CW}_y$ ).
$\hat{q}^{IQ}_y = \tau \cdot \hat{q}^{CW}_y + (1 - \tau) \cdot \hat{q}$
Where $\tau \in [0, 1]$ is a tunable parameter.
Behavior:
- $\tau = 0$ : Recovers Standard CP (small sets, poor rare-class coverage).
- $\tau = 1$ : Recovers Classwise CP (large sets, perfect rare-class coverage).
- $0 < \tau < 1$ : Provides a tunable compromise. The authors note that due to the skewed distribution of scores for rare classes, a small increase in $\tau$ (e.g., from 0 to 0.99) can significantly improve coverage with only a modest increase in set size.
Guarantee: Theoretically guarantees marginal coverage of at least $1 - 2\alpha$ (though empirically it performs closer to $1-\alpha$ ).

3. Key Contributions

Identification of the Trade-off: Formalized the inherent conflict between set size and class-conditional coverage in long-tailed settings, showing that existing methods fail to provide a practical middle ground.
New Score Function (PAS): Introduced Prevalence-Adjusted Softmax, which optimizes for macro-coverage by adjusting scores based on class prevalence, effectively approximating the oracle solution for balanced coverage.
New Procedure (INTERP-Q): Proposed a simple, parameterized interpolation between Standard and Classwise thresholds, allowing practitioners to explicitly tune the balance between set size and rare-class inclusion.
Weighted Extension: Extended the framework to Weighted PAS (WPAS), enabling the prioritization of specific subsets of classes (e.g., endangered species) without sacrificing the marginal coverage guarantee.

4. Experimental Results

The methods were evaluated on two large-scale, long-tailed image datasets: Pl@ntNet-300K (1,081 classes) and iNaturalist-2018 (8,142 classes).

Performance vs. Baselines:
- Standard CP: Small sets (avg size ~1.5 for Pl@ntNet) but high failure rate for rare classes (42% of species had <50% coverage).
- Classwise CP: Perfect coverage for all classes but unusable set sizes (avg size ~780 for Pl@ntNet).
- Standard with PAS: Achieved a Pareto-optimal trade-off. For Pl@ntNet, it reduced the number of poorly covered species by more than half (from 421 to 180) while only increasing the average set size slightly (from 1.57 to 2.57).
- INTERP-Q: Demonstrated that a small $\tau$ (e.g., 0.99) yields massive improvements in coverage with manageable set sizes (e.g., avg size 3.95 for Pl@ntNet), outperforming the linear interpolation intuition.
Endangered Species Case Study: Using WPAS, the authors successfully increased the coverage of IUCN-listed threatened species significantly, with negligible impact on the coverage of common species or the overall set size.
Human Decision-Making: Simulations showed that Standard with PAS provides the most balanced performance across different human decision-maker types (from random guessers to expert verifiers), maximizing the probability of correct identification while keeping the search space small.

5. Significance

This work is significant for several reasons:

Practical Applicability: It solves a critical bottleneck in deploying AI for citizen science and biodiversity monitoring (e.g., Pl@ntNet), where missing rare or endangered species is a major failure mode.
Theoretical Insight: It moves beyond the binary choice of marginal vs. class-conditional coverage, introducing macro-coverage as a viable and optimizable target for long-tailed problems.
Efficiency: The proposed methods (especially PAS) are computationally efficient and easy to implement (requiring only a change in the score function), making them accessible for real-world deployment.
Mitigating Model Collapse: By ensuring rare classes are included in prediction sets, the methods help prevent "model collapse" in human-in-the-loop systems, where neglecting niche classes leads to a degradation of the model's effective label space over time.

In summary, the paper provides robust, theoretically grounded, and empirically validated tools to make conformal prediction sets useful for the long-tailed classification problems that dominate many real-world applications.

Conformal Prediction for Long-Tailed Classification

Trick #1: The "Popularity Discount" (Prevalence-Adjusted Softmax)

Trick #2: The "Dimmer Switch" (INTERP-Q)

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

Approach I: Prevalence-Adjusted Softmax (PAS)

Approach II: Interpolated Quantile (INTERP-Q)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Horseshoe Priors and MDP

Observable Geometry of Singular Statistical Models

Conditional Independence under Infinite Measures and Poisson Point Processes

Sharp Debiasing for Smooth Functional Estimation in Banach Spaces

Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing Performance