Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation

Imagine you are a robot arm trying to assemble a delicate watch. You have to push a tiny gear into a slot. If you push too hard, you break the gear. If you push too soft, it doesn't fit. You need to know: "Am I 100% sure this will work?"

In the world of Artificial Intelligence (AI), deep learning models are like super-smart robots that can look at a picture and say, "Yes, that's a cat!" or "No, that's a dog!" But here's the problem: AI is often too confident. It might say, "I'm 99% sure this is a cat," when it's actually a fox. In a factory or a hospital, that kind of over-confidence can be dangerous.

This paper introduces a new tool called Wilson Score Kernel Density Classification (WS-KDC). Think of it as a "Reality Check" for AI. It doesn't just tell the AI what to guess; it tells the AI how much it can trust that guess, with a mathematical safety net.

Here is the breakdown using simple analogies:

1. The Problem: The Over-Confident Student

Imagine a student taking a test. They answer every question and give a confidence score (e.g., "I'm 90% sure this answer is right").

The Issue: Sometimes, the student is wrong, but they still feel 90% sure.
The Consequence: If this student is driving a car or performing surgery, that misplaced confidence is a disaster.
The Goal: We need a system that says, "I am only 60% sure, so I will stop and ask a human for help," rather than blindly guessing. This is called Selective Classification.

2. The Old Way: The Gaussian Process (The Slow, Heavy Calculator)

Before this paper, the best way to get these "safety nets" was using a method called Gaussian Process Classification (GPC).

The Analogy: Imagine trying to predict the weather by asking a super-smart meteorologist who has to read every single historical weather report in the world before making a prediction.
Pros: Very accurate.
Cons: It takes forever. If you have a million photos to check, this method might take days to calculate the confidence levels. It's like trying to solve a Rubik's cube while juggling.

3. The New Way: Wilson Score Kernel Density (The Smart, Fast Estimator)

The authors propose a new method: WS-KDC.

The Analogy: Instead of reading every single history book, imagine you are standing in a crowd. You want to know if it's going to rain.
- Step 1 (Kernel Smoothing): You look at the people right next to you. If 8 out of 10 people nearby are holding umbrellas, you assume it's likely raining. You don't care about people in a different city; you care about your immediate neighborhood.
- Step 2 (Wilson Score): You don't just guess "80% chance." You use a special mathematical rule (the Wilson Score) that says, "Okay, based on this small group, I am statistically sure the real chance is between 65% and 90%."
The Magic: This method is incredibly fast. It doesn't need to crunch the whole database. It just looks at the "neighbors" of the current situation and gives you a range (a lower and upper bound) of confidence.

4. How It Works in Real Life (The Robot Assembly)

The paper tested this on a robot arm inserting parts.

The Input: The robot takes a picture of the part being inserted.
The Feature Extractor: A pre-trained AI (like a "Vision Foundation Model") looks at the picture and turns it into a list of numbers (a "feature vector"). Think of this as the robot describing the picture in a secret code.
The WS-KDC Check: The new method looks at that code. It asks: "Have I seen similar codes before? If so, did they succeed or fail?"
The Decision:
- If the method says, "I am 95% sure this will succeed," the robot proceeds.
- If the method says, "My confidence is only 40%," the robot stops and waits for a human.

5. Why Is This a Big Deal?

The authors compared their new "Fast Estimator" (WS-KDC) against the "Slow Calculator" (GPC).

Accuracy: They were almost equally good at knowing when to trust the robot and when to stop.
Speed: The new method was 100 times faster.
- Analogy: If the old method took 10 minutes to decide if a robot should move, the new method took 0.1 seconds.
Simplicity: The new method only needs one "knob" to tune (how big the "neighborhood" is), whereas the old method needs many complex settings.

Summary

This paper gives us a fast, reliable, and easy-to-use safety guard for AI. It allows robots and medical AI to say, "I'm not sure," with mathematical proof, without slowing down the whole system. It turns AI from a "guessing game" into a "trustworthy partner" that knows its own limits.

1. Problem Statement

Deep learning-based binary classifiers have achieved high accuracy, making them suitable for automating critical inspection tasks (e.g., robotic assembly, medical diagnosis). However, a significant barrier to their deployment in safety-critical operations is the unreliability of confidence estimates.

Overconfidence: Standard deep learning models often produce confidence scores that are too optimistic, leading to untrustworthy classifications.
Lack of Statistical Guarantees: While calibration methods (e.g., Platt scaling, temperature scaling) exist, they do not provide statistically sound confidence bounds (upper and lower limits) on the probability of the predicted class.
The Need for Selective Classification: In critical applications, systems must be able to "abstain" from making a decision if the confidence is too low. To do this safely, the system requires rigorous confidence bounds that guarantee a specific success rate (e.g., 95%) with a defined statistical significance.

2. Methodology: Wilson Score Kernel Density Classification (WS-KDC)

The authors propose Wilson Score Kernel Density Classification (WS-KDC), a novel method that treats binary classification as a function estimation problem.

Core Concept

The method formulates the probability of a positive outcome, $S(x) = p(y=1|x)$ , as a function to be estimated. Instead of predicting a single probability value, WS-KDC estimates upper and lower confidence bounds for this probability.

The Wilson Score Kernel Density Estimator (WS-KDE)

The core engine of the method is the WS-KDE, which combines two concepts:

Kernel Density Estimation (KDE): Uses a Gaussian kernel to smooth data points in the feature space. This allows for the aggregation of neighboring samples to estimate local probabilities.
Wilson Score Interval: A frequentist statistical method for calculating confidence intervals for binomial proportions. Unlike the normal approximation, the Wilson Score performs well even with small sample sizes.

How it works:

Binning vs. Smoothing: A naive approach would bin the feature space and apply Wilson Score to each bin. WS-KDE improves this by using kernel smoothing to perform a weighted aggregation of neighboring points, fusing the Wilson Score method with continuous density estimation.
Assumptions: The method assumes the feature space is smooth and that samples are independent and identically distributed (i.i.d.) within the kernel's bandwidth.
Hyperparameters: The method has only one tunable hyperparameter: the bandwidth (lengthscale) of the Gaussian kernel. This is optimized via cross-validation to balance smoothing (reducing variance) and accuracy (avoiding over-smoothing).

Application to Selective Classification

The method is applied to selective classification, where the system outputs:

Class 1 (Positive): If the lower bound of the confidence interval $> \tau$ (target success rate).
Class 0 (Negative): If the upper bound of the confidence interval $< \tau$ .
Unknown (Abstain): If the interval overlaps with $\tau$ , indicating insufficient confidence.

3. Key Contributions

Novel Method: This is the first application of Wilson Score Kernel Density Estimation to the context of binary classification.
Statistical Soundness: Provides frequentist confidence bounds that are theoretically grounded, ensuring system performance up to a given statistical significance.
Model Agnosticism: The method acts as a "classification head" that can be attached to any feature extractor, including Vision Foundation Models (VFMs) like Dinov3 or standard CNNs like ResNet. It does not rely on the internal architecture of the feature extractor.
Computational Efficiency: The method requires significantly less computational power than Bayesian alternatives like Gaussian Processes.

4. Experimental Results

The authors evaluated WS-KDC against Gaussian Process Classification (GPC), a state-of-the-art method for uncertainty estimation, across four datasets:

Banknote Authentication (Wavelet features)
Cats & Dogs (ResNet18 features)
ChestMNIST (X-ray images, ResNet18 features)
Assembly Inspection (Robotic insertion, Dinov3 features)

Key Findings:

Performance Parity: In terms of Selective Classification Performance (measured by Area Under the Precision/Recall Reject Curves - AUPRC/AURRC), WS-KDC performed similarly to GPC. Neither method consistently outperformed the other in accuracy or recall when abstaining from uncertain predictions.
Computational Speed: WS-KDC was orders of magnitude faster than GPC.
- Optimization Time: WS-KDC optimized hyperparameters in ~1.5 seconds, whereas GPC took ~525 seconds for similar dataset sizes (over 300x faster).
- Scalability: GPC optimization time scales poorly with dataset size, while WS-KDC remains efficient.
Robustness: The method successfully handled features extracted from foundation models (Dinov3) and standard CNNs, proving its versatility.

5. Significance and Conclusion

The paper addresses a critical gap in deploying AI for safety-critical systems. By providing statistically rigorous confidence bounds with low computational overhead, WS-KDC enables:

Reliable Automation: Robots and inspection systems can safely operate by rejecting uncertain cases, ensuring a guaranteed success rate (e.g., 99% reliability on accepted cases).
Accessibility: The method is easier to tune (single hyperparameter) and faster to train than Gaussian Processes, making it more practical for real-world industrial applications.
Foundation Model Integration: It offers a lightweight, uncertainty-aware interface for large foundation models, which are often "black boxes" regarding their confidence estimates.

In summary, the authors present a highly efficient, statistically grounded alternative to Gaussian Processes for uncertainty quantification in binary classification, enabling safer and more reliable automated decision-making.