Do Metrics for Counterfactual Explanations Align with User Perception?

This paper presents an empirical study demonstrating that widely used algorithmic metrics for evaluating counterfactual explanations generally fail to align with human perceptions of explanation quality, highlighting the need for more human-centered evaluation approaches in explainable AI.

Felix Liedeker, Basil Ell, Philipp Cimiano, Christoph Düsing

Published 2026-03-17
📖 4 min read☕ Coffee break read

Imagine you are a chef who just invented a new recipe for a cake. You want to know if people will like it.

The Problem:
Instead of asking people to taste the cake, you decide to judge the recipe using a robot calculator. This robot measures things like:

  • How many ingredients are in the cake? (Sparsity)
  • How close is the cake to a standard chocolate cake? (Proximity)
  • How many different flavors are mixed in? (Diversity)

The robot gives the recipe a score of "9/10" because the math looks perfect. But when you actually serve the cake to real people, they hate it. They say it's too dry, the flavor is weird, or it's just not satisfying.

The Big Question:
This paper asks: Do the robot's scores actually match what humans think?

The researchers from Bielefeld University wanted to find out if the "mathy" ways we currently measure AI explanations (called Counterfactual Explanations) actually tell us if a human will find the explanation helpful, trustworthy, or easy to understand.

What is a "Counterfactual Explanation"?

Think of it as a "What If?" story.

  • Scenario: A bank rejects your loan application.
  • The Explanation: The AI says, "If your income had been $5,000 higher, or if you had paid off your credit card, we would have approved you."
  • This is a Counterfactual Explanation. It tells you the minimal changes needed to get a different result.

The Experiment

The researchers set up a massive taste test:

  1. The Ingredients: They used three different real-world datasets (like mushroom safety, obesity levels, and heart disease) to generate hundreds of these "What If?" explanations.
  2. The Robot: They calculated all the standard "math scores" for every single explanation (how close it was, how simple it was, etc.).
  3. The Humans: They hired 167 real people to look at these explanations and rate them on things like:
    • "Is this easy to understand?"
    • "Does this make sense?"
    • "Am I satisfied with this answer?"

The Shocking Results

The researchers expected the robot scores to be a good predictor of human happiness. They thought, "If the math says it's a 'perfect' explanation, humans should like it."

They were wrong.

Here is what they found, using simple analogies:

  • The "One Size Fits All" Myth: The robot scores that worked for the "Mushroom" dataset were completely useless for the "Heart Disease" dataset. It's like a thermometer that works perfectly in summer but breaks in winter. There was no universal rule.
  • The "More is Better" Trap: The researchers thought, "Maybe if we combine all the robot scores together, we'll get a perfect prediction." They tried mixing 1, 2, 3, up to 7 different math scores.
    • Result: Adding more scores didn't help. In fact, it made the prediction worse. It's like trying to guess the weather by looking at a thermometer, a barometer, a windsock, and a cloud chart all at once, but the more tools you add, the more confused you get.
  • The Weak Connection: The link between the robot's score and the human's rating was very weak. The robot was essentially guessing in the dark.

The Metaphor: The GPS vs. The Driver

Imagine you are driving a car.

  • The AI Metrics are like the GPS saying, "You are 5 feet from your destination." It gives you a precise number.
  • The Human Perception is the driver saying, "I can't see the road, the sign is confusing, and I feel unsafe."

The paper shows that the GPS (the AI metrics) is terrible at predicting whether the driver (the human) feels safe or understands the route. You can have a mathematically perfect route that feels terrifying to the driver.

Why Does This Matter?

Right now, many AI researchers and companies use these "robot scores" to say, "Our AI is explainable and trustworthy!"

This paper says: Stop.
If you build an AI system that is "perfect" according to the math, but the humans using it are confused or distrustful, the system has failed.

The Takeaway

We cannot rely on cold, hard math to measure how humans feel about AI.

  • Current State: We are judging explanations with a ruler, but humans care about the feeling of the explanation.
  • Future Need: We need to stop trying to force human feelings into math equations. Instead, we need to build AI evaluation systems that actually ask humans, "Do you understand this?" and take their answers seriously.

In short: Just because the math says an explanation is "good" doesn't mean a human will think it's good. We need to listen to people, not just the calculator.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →