A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

Imagine you are a doctor trying to decide whether to prescribe a powerful, expensive, and potentially harmful medication to a patient. You have a new AI tool that looks at the patient's data and gives a "risk score" from 0% to 100%.

The problem is: Where do you draw the line?

If you set the line at 10%, you catch almost everyone who is sick, but you also give the dangerous drug to many healthy people (False Positives). If you set the line at 90%, you only treat the very sickest, but you might miss people who are actually sick (False Negatives).

This paper argues that the way we currently judge these AI tools is like judging a chef only by how well they chop onions, ignoring whether the final soup actually tastes good or if the ingredients were fresh.

Here is the breakdown of the paper's argument using simple analogies:

1. The Problem: We Are Measuring the Wrong Thing

Currently, most scientists and engineers evaluate AI models using metrics like Accuracy (how often is it right?) or AUC-ROC (a complex curve that measures ranking).

The Analogy: Imagine you are hiring a security guard.
- Accuracy asks: "Did the guard correctly identify 99% of the people walking by?"
- The Flaw: If 99% of people are innocent, a guard who just says "No one is a threat" to everyone is 99% accurate! But they are useless.
- The Real World: In medicine or law, a "False Positive" (accusing an innocent person) and a "False Negative" (missing a guilty person) have very different costs. One might cost a patient their health; the other might cost a person their freedom.
- The Paper's Point: Current metrics often treat these two errors as if they cost the same amount, or they ignore the specific "price tag" of the mistake entirely.

2. The Solution: The "Consequentialist" View

The authors suggest we should judge AI based on consequences. Instead of asking "Is the math right?", we should ask, "If we use this AI to make decisions, how much good or bad will it cause?"

The Analogy: Think of the AI as a weather forecaster.
- If the forecaster says "50% chance of rain," do you bring an umbrella?
- If you are a farmer, a 50% chance might mean you don't water your crops (risk of drought).
- If you are a picnic planner, a 50% chance might mean you cancel the event (risk of getting wet).
- The "right" answer depends entirely on your specific situation (your "threshold").

The paper argues we need to evaluate the AI across a range of possible situations, not just one fixed setting.

3. The New Tools: "Bounded" Scoring

The authors introduce a new way to measure these models called Bounded Threshold Scoring Rules.

The Old Way (The "Full Ocean" approach): Traditional methods (like the Brier Score) average the model's performance over every possible scenario, from "0% chance of rain" to "100% chance of rain."
- Critique: This is like judging a weather forecaster on whether they were right about a tornado in a desert. It's mathematically sound, but practically useless because tornadoes don't happen in that desert.
The New Way (The "Swimming Pool" approach): The authors propose we only judge the model on the plausible range of scenarios.
- If doctors know they will only prescribe a drug if the risk is between 10% and 30%, we should only test the AI's performance in that 10–30% zone.
- They call this "Clipping." It's like putting a fence around the swimming pool and only counting how well the swimmer does inside the fence, ignoring the ocean outside.

4. The "Briertools" Package

The authors didn't just write theory; they built a tool called briertools.

The Analogy: Imagine a carpenter who has been using a hammer to drive screws for 50 years because it's all they have. The authors built a screwdriver specifically designed for the job.
This tool makes it easy for doctors, lawyers, and data scientists to plug in their specific "costs" (e.g., "We can tolerate 10 false alarms, but we can't miss 1 real case") and instantly see which AI model is actually the best for their specific job.

5. The Case Study: Breast Cancer

To prove it works, they tested this on breast cancer risk prediction.

The Situation: Doctors disagree on the exact risk percentage that should trigger a treatment. Some say 1.66%, others say 3%.
The Result: When they used the old "Average" method, one model looked best. But when they used the new "Bounded" method (focusing only on the 1.66%–3% range), a different model was actually the winner.
The Lesson: The "best" model depends entirely on where you draw the line. If you don't know exactly where the line is, you should test the model across the whole range where the line might be.

Summary

This paper is a call to stop judging AI models by abstract math scores that don't match real life.

Old Way: "Look how high this number is!" (ignoring context).
New Way: "Let's simulate the real-world decisions, figure out what the costs are, and see which model causes the least harm in the specific situations we care about."

They provide the math, the theory, and the software to help us finally judge AI by how well it helps us make real decisions, rather than just how well it solves a math puzzle.

Here is a detailed technical summary of the paper "A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools" by Flores et al.

1. Problem Statement

Machine learning systems in high-stakes domains (e.g., healthcare, criminal justice) often require converting probabilistic forecasts into binary decisions. However, the current evaluation practices in major ML venues (ICML, FAccT, CHIL) are frequently misaligned with real-world deployment scenarios.

The Disconnect: Practitioners predominantly rely on fixed-threshold metrics (like Accuracy) or ranking metrics (like AUC-ROC) that assume either a known, fixed decision threshold or a fixed budget of positive predictions (Top-K).
The Reality: In many real-world applications, the decision threshold is uncertain (dependent on varying cost ratios or expert consensus) and decisions are made independently across instances.
The Consequence: Using metrics like Accuracy implicitly assumes equal costs for false positives and false negatives, while AUC-ROC implicitly weights costs based on the model's own score distribution rather than domain-specific costs. This leads to suboptimal model selection that does not maximize real-world utility.

2. Methodology and Theoretical Framework

The authors adopt a consequentialist perspective from decision theory, framing evaluation as the minimization of expected regret (the excess cost of a model compared to the optimal decision rule).

A. Decision-Theoretic Taxonomy

The paper introduces a taxonomy based on two dimensions to determine the appropriate metric:

Instance Coupling: Are decisions independent (additive costs) or coupled by a fixed budget (Top-K)?
Threshold Specificity: Is the decision threshold known exactly, or is it uncertain/mixed?

This framework maps metrics to contexts:

Independent + Fixed Threshold: Accuracy (special case of Net Benefit at $c=0.5$ ).
Independent + Uncertain Threshold: Proper Scoring Rules (Brier Score, Log Loss).
Dependent (Top-K) + Fixed Budget: Net Benefit@K, Precision@K.
Dependent + Uncertain Budget: AUC-ROC.

B. Theoretical Contributions

Bounded-Threshold Scoring Rules:
- Standard proper scoring rules (Brier, Log Loss) average regret over the entire unit interval $[0, 1]$ , which includes implausible cost ratios (e.g., treating a false positive as equally costly as a false negative in life-or-death scenarios).
- The authors derive clipped variants of the Brier Score and Log Loss that average regret only over a bounded interval $[a, b]$ of plausible cost ratios.
- Key Insight: These bounded scores can be computed efficiently by projecting predictions and labels onto $[a, b]$ and calculating the standard loss, avoiding full numerical integration.
- Formula Concept: $E[\text{Regret}_{[a,b]}] \propto \text{Loss}(\text{clip}_{[a,b]}(s)) - \text{Loss}(\text{clip}_{[a,b]}(y))$ .
Reconciliation with Decision Curve Analysis (DCA):
- The paper addresses the critique by Assel et al. (2017) that Brier scores lack clinical utility because they average over irrelevant thresholds.
- The authors prove that DCA (Net Benefit) is mathematically equivalent to a regret-based approach.
- They demonstrate that Bounded Brier Score is equivalent to the average Net Benefit over the interval $[a, b]$ . This resolves the tension: DCA is for fixed thresholds; Bounded Scoring Rules are for uncertain but bounded thresholds.
Calibration and Discrimination Decomposition:
- The framework allows for the additive decomposition of proper scoring rules into calibration and discrimination components on a commensurable scale, unlike Top-K metrics which ignore calibration.

3. Key Results and Empirical Findings

A. Literature Survey (ICML, FAccT, CHIL 2024)

Using an LLM-assisted review of 2,610 papers:

Accuracy dominates in general ML (ICML/FAccT > 50%), despite its unrealistic assumption of equal error costs.
AUC-ROC is prevalent in healthcare (CHIL), but it implicitly weights costs by the model's score distribution, deferring normative cost judgments to the model rather than the domain expert.
Proper Scoring Rules (Brier, Log Loss) are rarely used (< 15% and < 5% respectively), highlighting a gap between theory and practice.

B. Case Study: Breast Cancer Risk Prediction

The authors applied their framework to a breast cancer treatment scenario where the treatment threshold is debated (1.66% vs. 3%).

Scenario: Comparing Logistic Regression, XGBoost, and a modified XGBoost trained with an internal 2% threshold.
Finding:
- Global metrics (standard Brier, Log Loss) ranked the modified XGBoost as the worst performer because it performed poorly at extreme, irrelevant thresholds.
- Bounded-Threshold metrics (restricted to the clinically relevant 1.66%–3% range) ranked the modified XGBoost as the best performer.
Conclusion: Threshold-aware evaluation can reverse model selection, prioritizing models that perform well in the specific operational context over those with "average" global performance.

C. Tooling: `briertools`

The authors released a Python package, briertools, which:

Implements bounded-threshold Brier and Log Loss.
Visualizes regret curves and decision curves.
Provides additive decompositions of calibration and discrimination.
Lowers the barrier to adopting proper scoring rules in practice.

4. Significance and Impact

Paradigm Shift: The paper argues that evaluation should not be a static mathematical exercise but a dynamic reflection of real-world consequences. It moves the field away from "one-size-fits-all" metrics toward context-aware evaluation.
Bridging Theory and Practice: By providing bounded scoring rules, the authors solve the "over-averaging" problem of standard Brier scores while maintaining their mathematical rigor and interpretability.
Actionable Guidance: The proposed taxonomy and tools provide practitioners with a clear path to select metrics that align with their specific deployment constraints (e.g., uncertain thresholds in medicine vs. fixed budgets in resource allocation).
Ethical Implications: In high-stakes domains, using metrics like Accuracy can lead to ethically problematic outcomes (e.g., equating wrongful imprisonment with wrongful release). The consequentialist framework forces explicit consideration of these costs.

In summary, this paper provides a rigorous theoretical foundation and practical tools to align machine learning evaluation with the actual decision-making consequences of the models, advocating for the use of bounded proper scoring rules in scenarios where decision thresholds are uncertain but bounded.

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

1. The Problem: We Are Measuring the Wrong Thing

2. The Solution: The "Consequentialist" View

3. The New Tools: "Bounded" Scoring

4. The "Briertools" Package

5. The Case Study: Breast Cancer

Summary

1. Problem Statement

2. Methodology and Theoretical Framework

A. Decision-Theoretic Taxonomy

B. Theoretical Contributions

3. Key Results and Empirical Findings

A. Literature Survey (ICML, FAccT, CHIL 2024)

B. Case Study: Breast Cancer Risk Prediction

C. Tooling: briertools

4. Significance and Impact

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

C. Tooling: `briertools`