Do Metrics for Counterfactual Explanations Align with User Perception?

Imagine you are a chef who just invented a new recipe for a cake. You want to know if people will like it.

The Problem:
Instead of asking people to taste the cake, you decide to judge the recipe using a robot calculator. This robot measures things like:

How many ingredients are in the cake? (Sparsity)
How close is the cake to a standard chocolate cake? (Proximity)
How many different flavors are mixed in? (Diversity)

The robot gives the recipe a score of "9/10" because the math looks perfect. But when you actually serve the cake to real people, they hate it. They say it's too dry, the flavor is weird, or it's just not satisfying.

The Big Question:
This paper asks: Do the robot's scores actually match what humans think?

The researchers from Bielefeld University wanted to find out if the "mathy" ways we currently measure AI explanations (called Counterfactual Explanations) actually tell us if a human will find the explanation helpful, trustworthy, or easy to understand.

What is a "Counterfactual Explanation"?

Think of it as a "What If?" story.

Scenario: A bank rejects your loan application.
The Explanation: The AI says, "If your income had been $5,000 higher, or if you had paid off your credit card, we would have approved you."
This is a Counterfactual Explanation. It tells you the minimal changes needed to get a different result.

The Experiment

The researchers set up a massive taste test:

The Ingredients: They used three different real-world datasets (like mushroom safety, obesity levels, and heart disease) to generate hundreds of these "What If?" explanations.
The Robot: They calculated all the standard "math scores" for every single explanation (how close it was, how simple it was, etc.).
The Humans: They hired 167 real people to look at these explanations and rate them on things like:
- "Is this easy to understand?"
- "Does this make sense?"
- "Am I satisfied with this answer?"

The Shocking Results

The researchers expected the robot scores to be a good predictor of human happiness. They thought, "If the math says it's a 'perfect' explanation, humans should like it."

They were wrong.

Here is what they found, using simple analogies:

The "One Size Fits All" Myth: The robot scores that worked for the "Mushroom" dataset were completely useless for the "Heart Disease" dataset. It's like a thermometer that works perfectly in summer but breaks in winter. There was no universal rule.
The "More is Better" Trap: The researchers thought, "Maybe if we combine all the robot scores together, we'll get a perfect prediction." They tried mixing 1, 2, 3, up to 7 different math scores.
- Result: Adding more scores didn't help. In fact, it made the prediction worse. It's like trying to guess the weather by looking at a thermometer, a barometer, a windsock, and a cloud chart all at once, but the more tools you add, the more confused you get.
The Weak Connection: The link between the robot's score and the human's rating was very weak. The robot was essentially guessing in the dark.

The Metaphor: The GPS vs. The Driver

Imagine you are driving a car.

The AI Metrics are like the GPS saying, "You are 5 feet from your destination." It gives you a precise number.
The Human Perception is the driver saying, "I can't see the road, the sign is confusing, and I feel unsafe."

The paper shows that the GPS (the AI metrics) is terrible at predicting whether the driver (the human) feels safe or understands the route. You can have a mathematically perfect route that feels terrifying to the driver.

Why Does This Matter?

Right now, many AI researchers and companies use these "robot scores" to say, "Our AI is explainable and trustworthy!"

This paper says: Stop.
If you build an AI system that is "perfect" according to the math, but the humans using it are confused or distrustful, the system has failed.

The Takeaway

We cannot rely on cold, hard math to measure how humans feel about AI.

Current State: We are judging explanations with a ruler, but humans care about the feeling of the explanation.
Future Need: We need to stop trying to force human feelings into math equations. Instead, we need to build AI evaluation systems that actually ask humans, "Do you understand this?" and take their answers seriously.

In short: Just because the math says an explanation is "good" doesn't mean a human will think it's good. We need to listen to people, not just the calculator.

1. Problem Statement

Explainable Artificial Intelligence (XAI) relies heavily on Counterfactual Explanations (CFs) to provide actionable, contrastive insights into model predictions. Currently, the quality of CFs is predominantly evaluated using automated algorithmic metrics (e.g., sparsity, proximity, plausibility) because they are computationally efficient. However, these metrics are rarely validated against human judgments.

The core research question is: Do widely used automated evaluation metrics for counterfactual explanations meaningfully reflect human perceptions of explanation quality? The authors hypothesize that a structural mismatch exists between computational desiderata and human psychological requirements for explanations.

2. Methodology

The study employs a controlled empirical approach combining a large-scale user study with rigorous statistical analysis of automated metrics.

A. Data and Setup

Datasets: Three tabular classification datasets from the UCI repository were selected to ensure diversity in features and label structures:
- Mushroom (MUS): Binary classification (Edible vs. Poisonous).
- Obesity Levels (OBE): Multi-class classification (7 levels).
- Heart Disease (HRT): Binary classification.
Model & Generation: XGBoost was used as the base classifier. CFs were generated using the Counterfactuals Guided by Prototypes method (via the Alibi Explain library), which searches for instances close to target-class prototypes to ensure plausibility.
Sampling: From the generated CFs, a subset of 85 explanations (30 MUS, 30 OBE, 25 HRT) was selected using cluster-preserving sampling based on seven automated metrics to ensure diversity in the explanation space.

B. User Study Design

Participants: 167 participants recruited via Prolific.
Procedure: Participants rated the 85 CFs across five dimensions using a 4-point Likert scale:
1. Perceived Accuracy
2. Understandability
3. Plausibility
4. Sufficiency of Detail
5. User Satisfaction
Aggregation: The five dimensions were aggregated into a single Combined Quality Score (CQS) based on high internal consistency (Cronbach's $\alpha = 0.88$ ) and unidimensionality.

C. Automated Metrics

Seven standard automated CF metrics were computed for each explanation:

Sparsity: Number of changed features.
Proximity: Distance between original and CF (L1 norm).
Closeness: Distance to the data manifold (nearest neighbors in training set).
Diversity: Independence of changed features.
Oracle Score: Agreement between the base model and an oracle model on the target class.
Trust Score: Reliability of the prediction based on distance to class clusters.
Completeness: Coverage of top feature importances (via SHAP) in the changed features.

D. Analysis Strategy

Correlation Analysis: Pearson correlations were calculated between individual metrics and human ratings (both per-dimension and CQS).
Predictive Modeling: An exhaustive powerset analysis (127 non-empty subsets of the 7 metrics) was conducted. Five model classes (Linear Regression, kNN, Random Forest, XGBoost, GAMs) were trained to predict human ratings. Performance was measured via 5-fold cross-validated $R^2$ .

3. Key Contributions

Systematic Empirical Validation: The first large-scale study directly comparing a comprehensive set of automated CF metrics against human judgments across multiple datasets and quality dimensions.
Evidence of Misalignment: Demonstration that automated metrics generally fail to correlate with human perceptions of quality.
Complexity Analysis: Investigation into whether combining multiple metrics improves prediction, revealing that adding more metrics often degrades performance rather than improving it.
Call for Human-Centered Metrics: A strong argument for shifting XAI evaluation paradigms away from purely algorithmic proxies toward human-grounded approaches.

4. Key Results

A. Weak and Dataset-Dependent Correlations

Overall Weakness: Correlations between automated metrics and human ratings were generally weak ( $|r| < 0.1$ for most).
Dataset Variance:
- Mushroom (Binary): Metrics like sparsity and proximity showed moderate negative correlations with satisfaction (users preferred fewer/smaller changes).
- Obesity (Multi-class): Metrics like diversity and trust score showed positive correlations (users preferred more comprehensive changes).
- Heart Disease: No consistent or significant correlations were found.
Conclusion: The relationship between metrics and human perception is highly dataset-specific, not universal. Only the Trust Score showed a weak but significant aggregate correlation ( $r=0.307$ ).

B. Failure of Predictive Modeling

Linear Models: Linear regression consistently failed, yielding strongly negative $R^2$ values (mean $R^2 = -1.253$ ), indicating that linear combinations of metrics cannot explain human rating variance.
Non-Linear Models: While Random Forest (RF) performed best among non-linear models, its predictive power remained low (mean $R^2 \approx 0.067$ ).
The "More is Less" Phenomenon: Increasing the number of metrics in the predictive model did not improve performance. In fact, performance peaked at 3–4 metrics and then degraded as more metrics were added. This suggests that current metrics do not provide complementary information regarding human perception; instead, they introduce noise.

5. Significance and Implications

Structural Mismatch: The study provides strong evidence that current automated metrics quantify properties (e.g., mathematical distance, sparsity) that do not align with what humans value (e.g., context, actionability, psychological plausibility).
Critique of Current Practices: Relying on automated metrics as proxies for human evaluation in XAI research is fundamentally flawed. Reporting multiple metrics side-by-side does not solve the problem if the underlying metrics are uncorrelated with human judgment.
Future Direction: The field must move toward human-centered evaluation. Future metrics should be grounded in cognitive science and validated against user perception rather than purely computational objectives.
Limitations: The study is limited to tabular data, a single CF generation method, and lay participants. However, the robustness of the negative findings suggests the misalignment is a structural issue in the field, not just a sample artifact.

In summary, the paper concludes that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, necessitating a paradigm shift in how XAI systems are assessed.