Uncertainty Estimation for the Open-Set Text Classification systems

The Big Picture: The "Know-It-All" AI Problem

Imagine you hire a very smart but overconfident librarian (your AI system) to sort books into specific shelves: "Cooking," "Sci-Fi," and "History."

In a perfect world, every book the librarian sees belongs to one of these three shelves. But in the real world, people bring in weird stuff: a recipe for a sandwich, a comic book, or a blank piece of paper.

The Problem:
Most AI systems are like that overconfident librarian. If you hand them a blank piece of paper, they won't say, "I don't know what this is." Instead, they will force it onto the "History" shelf because it's the closest match, even though it's wrong. They are confident, but they are wrong. This is dangerous if the AI is making decisions about bank accounts, medical advice, or legal documents.

The Goal of This Paper:
The authors want to teach the AI to say, "I'm not sure about this one," and stop before it makes a mistake. They want the AI to measure its own uncertainty.

The Two Types of "Confusion"

The paper argues that an AI gets confused for two very different reasons. To fix this, you need to understand both:

1. The "Blurry Photo" Problem (Embedding Uncertainty)

The Analogy: Imagine you are trying to identify a friend in a crowd, but they are wearing a heavy foggy mask, or the photo of them is very grainy. Even if you know exactly what your friend looks like, the input is bad.
In Text: This happens when a user types a query with bad grammar, slang, or typos. The AI can't "see" the meaning clearly.
The Solution: The AI needs to realize, "Hey, this sentence is messy. I can't trust my guess."

2. The "Twin Brothers" Problem (Gallery Uncertainty)

The Analogy: Imagine you are trying to identify your friend, but standing right next to them is their identical twin brother. Even if the photo is crystal clear, it's impossible to tell who is who because they look exactly the same.
In Text: This happens when two different categories are very similar. For example, a user asking "How do I check my bank balance?" is very similar to "How do I check my credit card limit?" The AI knows both answers, but the question is right on the line between the two.
The Solution: The AI needs to realize, "This question is right on the border between two categories. I shouldn't guess."

The New Tool: "HolUE" (The Holistic Detective)

The authors created a new method called HolUE (Holistic Uncertainty Estimation). Think of it as a detective who doesn't just look at the suspect (the text), but also looks at the crime scene (the database of known answers).

Old Methods:
- Method A (The "Distance" Checker): Only looks at how far away the text is from the known answers. If it's far, it says "Unknown." If it's close, it says "Known." Flaw: It misses the "Twin Brother" problem.
- Method B (The "Quality" Checker): Only looks at how clear the text is. If the text is messy, it says "Unknown." Flaw: It misses the "Twin Brother" problem. A clear text can still be ambiguous.
The HolUE Method:
This detective combines both views.
1. Is the input messy? (Blurry photo check).
2. Is the input stuck between two similar categories? (Twin brother check).
If either is true, the AI raises a red flag: "High Uncertainty! Do not make a decision yet."

How They Tested It

They tested this new detective on three different "jobs":

The Authorship Job: Trying to guess who wrote a book.
- Challenge: Distinguishing between a real author and a forger who writes exactly like them.
- Result: HolUE was much better at spotting the forgers without accidentally accusing the real author.
The Intent Job: Trying to guess what a user wants (e.g., "Call a taxi" vs. "Check the weather").
- Challenge: Users often ask weird questions that don't fit any category.
- Result: HolUE successfully rejected the weird questions instead of forcing them into the wrong category.
The Topic Job: Sorting news articles into topics like "Sports" or "Politics."
- Challenge: Some articles are about both, or about something totally new.
- Result: HolUE improved the accuracy by a huge margin (up to 365% better in some cases!) compared to older methods.

The Takeaway

The main message of this paper is simple: Being "accurate" isn't enough; you need to be "humble."

A truly smart AI system shouldn't just try to get the answer right every time. It should be smart enough to know when it doesn't know the answer. By teaching the AI to measure its own confusion (uncertainty), we can build systems that are safer, more trustworthy, and less likely to make embarrassing or dangerous mistakes when they encounter the unknown.

In short: The authors gave the AI a "lie detector" for its own confidence, allowing it to say, "I'm not sure, let's ask a human," instead of confidently guessing wrong.

1. Problem Statement

The paper addresses the Open-Set Text Classification (OSTC) problem, a critical challenge in Natural Language Processing (NLP) where a system must classify an input text into one of a set of known classes (the "gallery") or reject it as unknown. Unlike closed-set classification, OSTC assumes the presence of out-of-distribution (OOD) samples during inference.

The core problem identified is that existing OSTC methods focus primarily on maximizing recognition accuracy or OOD detection rates (e.g., minimizing False Acceptance/Rejection) but fail to provide calibrated uncertainty estimates. Without accurate uncertainty estimation, systems cannot reliably determine when to abstain from making a decision, which is crucial for risk-sensitive applications like conversational agents, authorship verification, and content filtering.

The authors distinguish between two primary sources of prediction errors in OSTC:

Gallery Uncertainty: Arises from the geometric structure of the embedding space (e.g., a sample lying near the decision boundary between two known classes).
Embedding Uncertainty: Stems from the quality or ambiguity of the input data itself (e.g., noisy phrasing, slang, or stylistic variations causing high variance in the embedding).

2. Methodology

The authors propose adapting the Holistic Uncertainty Estimation (HolUE) framework, originally developed for biometric face recognition, to the text domain. The methodology consists of three main components:

A. Probabilistic Text Embeddings

Instead of deterministic point estimates, the system generates probabilistic embeddings using a Spherical Confidence Face (SCF) architecture adapted for Transformers (BERT):

Feature Extraction: Input text is encoded via a pre-trained BERT model, extracting the [CLS] token and projecting it through an MLP bottleneck.
Probabilistic Head: Two parallel heads process the bottleneck features:
1. A mean vector $\mu(x)$ representing the direction on the hypersphere.
2. A concentration parameter $\kappa(x)$ representing the inverse variance.
Distribution: These parameters define a von Mises-Fisher (vMF) distribution $p(z|x)$ over the hypersphere. Low $\kappa$ indicates high uncertainty (ambiguous/noisy input).

B. Bayesian Uncertainty Modeling

The framework formulates OSTC within a Bayesian probabilistic framework to reconstruct the posterior class distribution $p(c|x)$ . It integrates the probabilistic embedding distribution with the gallery structure:

Gallery Modeling: Known classes are modeled as vMF distributions centered at class prototypes. Unknown classes are modeled as a uniform distribution over the sphere.
Uncertainty Quantification: The system computes the Kullback-Leibler (KL) divergence between the posterior distribution $p(c|x)$ $p (c ∣ x)$ and the prior class distribution $p(c)$ $p (c)$ .
- $KL_1$ (Gallery Uncertainty): Measures ambiguity related to the geometric structure (e.g., proximity to decision boundaries).
- $KL_2$ (Embedding Uncertainty): Measures uncertainty related to sample quality (variance of the embedding).

C. Calibration and Fusion

The two KL components are normalized using statistics from a validation set and fused using a lightweight Multi-Layer Perceptron (MLP). This MLP is trained to optimize error detection at a fixed False Positive Identification Rate (FPIR), ensuring the final uncertainty score is calibrated and directly correlates with the probability of recognition error.

3. Key Contributions

Identification of Uncertainty Sources: The paper explicitly identifies and formalizes query ambiguity (embedding uncertainty) and gallery structure (gallery uncertainty) as the two primary drivers of errors in NLP-based Open-Set Recognition.
Adaptation of HolUE to Text: The authors successfully bridge the gap between biometric and text-based OSR by adapting the HolUE framework to transformer-based text embeddings.
New Benchmark: They introduce a challenging OSR benchmark for Authorship Attribution based on the PAN dataset, featuring a dynamic gallery setup where known authors are introduced during the testing phase to simulate real-world conditions.
Comprehensive Evaluation: Extensive experiments across diverse tasks (Intent Classification, Authorship Attribution, and Topic Classification) demonstrate that capturing both uncertainty sources significantly outperforms methods relying solely on sample quality or acceptance scores.

4. Experimental Results

The proposed HolUE method was evaluated against baselines including AccScr (distance to decision boundary) and SCF (sample quality only) across multiple datasets: Yahoo Answers, AGNews, DBPedia, PAN-20-AV, and CLINC150.

Performance Metric: The primary metric used was the Prediction Rejection Ratio (PRR), which measures how effectively a method filters out erroneous predictions before they occur (higher is better).

Key Findings:

Superior Performance: HolUE consistently outperformed all baselines across all datasets and FPIR thresholds.
- Yahoo Answers: 365% improvement over SCF (PRR 0.79 vs. 0.17 at FPIR 0.1).
- DBPedia: 347% improvement over SCF (PRR 0.85 vs. 0.19).
- PAN Authorship: 240% improvement over SCF (PRR 0.51 vs. 0.15 at FPIR 0.5).
- CLINC150: 40% improvement over SCF (PRR 0.73 vs. 0.52).
Error Detection: HolUE successfully detected all three types of OSR errors:
- Misidentification: Detected via gallery structure analysis ( $KL_1$ ).
- False Rejection: Detected via embedding variance estimation ( $KL_2$ ).
- False Acceptance: Detected via the combination of both.
Robustness: While SCF (sample quality) failed on the PAN dataset (where stylistic ambiguity mimics high-quality input), and AccScr failed on topic classification (where boundaries are less distinct), HolUE maintained robust performance by leveraging both sources of information.

5. Significance

This work is significant because it moves beyond the traditional goal of "improving accuracy" to focus on system trustworthiness and risk management.

Risk-Controlled Deployment: By providing a calibrated uncertainty score, the system can defer decisions to human operators when confidence is low, preventing costly errors in critical applications (e.g., financial intent classification or forensic authorship analysis).
Domain Agnosticism: The successful transfer of biometric uncertainty principles to NLP suggests that probabilistic embedding approaches are a powerful, domain-agnostic solution for open-world machine learning.
Future Directions: The authors suggest this framework can be extended to detect hallucinations in generative language models, offering a pathway to enhance the reliability of open-ended text generation.

In conclusion, the paper demonstrates that accurate uncertainty estimation in text classification requires a holistic view that combines the geometric structure of the known classes with the intrinsic quality of the input data, a principle that HolUE successfully implements.