Assessing Model-Agnostic XAI Methods against EU AI Act… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a magic black box that makes important decisions about your life—like whether you get a loan, a job, or medical treatment. You know the box gives an answer, but you have no idea how it decided that. It's like a chef who hands you a delicious meal but refuses to tell you the recipe or even what ingredients are in it.

In the past, if you asked, "Why did you do this?", the chef might say, "Because the magic box said so." But the EU AI Act (a new set of strict rules for Europe) is saying: "No more magic tricks. You must explain your recipe."

This paper is like a guidebook for chefs and inspectors trying to figure out which tools can help them explain the black box's secrets without breaking the law.

Here is the breakdown of what the authors did, using simple analogies:

1. The Problem: The "Translation Gap"

There is a huge misunderstanding between lawyers and computer scientists.

Lawyers want explanations that are fair, clear, and help people understand why a decision was made so they can fight it if it's wrong.
Computer Scientists have built many tools (called XAI methods) to explain AI, but they often speak a different language. They focus on math and code, not on human understanding or legal rights.

The authors realized there was a "gap" where companies didn't know which tool to use to stay legal. They asked: "Which of these computer tools actually satisfies the law's requirements?"

2. The Solution: A "Compliance Scorecard"

To fix this, the authors built a scoring system. Think of it like a restaurant health inspection, but instead of checking for dirty floors, they check if the AI explanation tool is "legal."

They looked at three main qualities (properties) that any good explanation needs:

Faithfulness (The "Truth" Test): Does the explanation actually match what the AI is thinking? Or is it just making up a story? (e.g., If the AI rejected your loan because of your age, the explanation shouldn't say it was because of your hair color).
Robustness (The "Stability" Test): If you change the input slightly (like changing your age by one day), does the explanation stay the same? If the explanation flips wildly with tiny changes, it's not reliable.
Complexity (The "Simplicity" Test): Is the explanation too complicated for a human to understand? A 100-page math proof isn't a good explanation for a regular person.

3. The Process: How They Scored the Tools

The authors took a list of popular AI explanation tools (like SHAP, LIME, Decision Trees, and Counterfactuals) and gave them grades from 1 to 5 on those three qualities.

SHAP was like the honest accountant: Very accurate and truthful, but sometimes a bit slow and complex.
Decision Trees were like the clear flowchart: Very easy to read, but if you change one line of data, the whole chart might collapse (unstable).
LIME was like a quick sketch: Good for a quick guess, but if you ask it twice, you might get two different sketches (unstable).

4. The Result: Matching Tools to Rules

The EU AI Act has different rules for different situations. The authors created a map to show which tool fits which rule:

Scenario A: "Why was my specific loan rejected?" (Art. 86)
- The Law wants: A clear, specific reason for you.
- The Best Tool: SHAP or RuleSHAP. They are the most honest about exactly which factors mattered for your specific case.
Scenario B: "How does this system work in general?" (Art. 11 & 13)
- The Law wants: A big-picture view for the company's internal files.
- The Best Tool: Decision Trees or RuleFit. They provide a simple, readable list of rules that humans can easily digest.

5. The Big Takeaway

The paper concludes that there is no single "perfect" tool. It's like asking, "What is the best vehicle?"

If you need to race, you need a Formula 1 car (SHAP: accurate but complex).
If you need to drive a family to the grocery store, you need a minivan (Decision Trees: simple and clear).

The authors' framework helps companies pick the right "vehicle" for the specific legal "road" they are driving on.

Why This Matters

Before this paper, companies were guessing which tools to use to comply with the new laws. This paper gives them a checklist and a score.

If a company uses a tool that scores low on "Truth" (Faithfulness), they might get fined for lying about how their AI works.
If they use a tool that scores low on "Simplicity" (Complexity), they might get fined for confusing the customer.

In short: This paper bridges the gap between the legal world (which wants fairness and clarity) and the tech world (which builds complex math). It tells engineers, "Here is the math tool that will keep you out of trouble with the law."

1. Problem Statement

The paper addresses a critical "transparency gap" between the technical capabilities of current Explainable AI (XAI) methods and the legal obligations imposed by the EU AI Act.

The Disconnect: While XAI is often defined narrowly as technical algorithmic explanation, the EU AI Act treats explanations as legal tools for accountability, human control, and rights protection.
The Challenge: Practitioners (especially smaller companies) lack clear guidance on which XAI techniques satisfy specific regulatory requirements (e.g., Article 86 for individual explanations vs. Article 11 for technical documentation). Existing literature focuses on algorithmic aspects but lacks a systematic mapping of XAI properties to legal mandates.
The Goal: To bridge this gap by creating a framework that evaluates model-agnostic XAI methods against the specific explanatory goals and obligations of the EU AI Act.

2. Methodology

The authors propose a mixed-methods scoring framework that combines qualitative legal/technical assessment with quantitative aggregation.

A. Conceptual Definitions

Definition of Explanation: Adopting an "ordinary-language philosophy" approach, an explanation is defined as an "answer to implicit or explicit questions" that facilitates understanding. This aligns with legal contexts where explanations need not be fully personalized to a specific mental model but must be clear to defined categories (e.g., the "average" customer).
XAI Categorization: The study focuses on interpretability algorithms (extracting information from black-box models) rather than "surfacing methods" (formatting for users), as legal requirements primarily concern content and clarity.

B. Property Extraction & Mapping

XAI Properties: Based on literature (specifically [8]), the authors categorize XAI properties into three clusters:
- Faithfulness (Correctness): Accuracy in reflecting the model's reasoning (Necessity, Sufficiency, Completeness).
- Robustness (Stability): Resistance to input perturbations and adversarial attacks (Stability/Continuity, Adversarial Robustness).
- Complexity: The level of detail and sparsity of the explanation (Sparsity, Granularity/Level of Detail).
Legal Requirements: The authors analyzed the EU AI Act (Articles 11, 13–14, 86, and Annex IV) to identify explanation obligations. These were mapped to the XAI properties, assigning a Legal Strength Factor ( $\lambda_s$ ) to each:
- Mandatory ( $\lambda=1$ ): Required by law.
- Optional/Preferable ( $\lambda=0.75$ ): Recommended but not strictly mandatory.
- Partial ( $\lambda=0.5$ ): Partially required.
- Not Required ( $\lambda=0$ ): Not specified.

C. Scoring Framework

The authors evaluated 12 widely used model-agnostic XAI algorithms (e.g., SHAP, LIME, Decision Trees, CEM, DiCE, Anchors).

Qualitative Assessment: Each algorithm was rated on a 1–5 Likert scale for every sub-property (1 = not exhibited, 5 = fully exhibited). This was done via a two-stage process: initial estimation using LLMs (ChatGPT 4o) followed by rigorous human expert review and revision based on scientific literature.
Quantitative Aggregation: A compliance score $S(a, r)$ $S (a, r)$ is calculated for an algorithm $a$ $a$ against a regulation article $r$ $r$ using a weighted formula:
- Normalization: Scores are normalized to [0, 1].
- Weighting: Sub-property scores are weighted by their legal strength factor ( $\lambda_s$ ).
- Procedural Fit (Filter): The algorithm must match the scope (local vs. global) and stage (ex-ante vs. ex-post) required by the specific article. If the scope/stage does not match, the score is zero.
- Formula: The final score represents the percentage of regulatory alignment.

3. Key Contributions

Systematic Mapping: The first comprehensive mapping of specific EU AI Act articles to concrete XAI interpretability properties (Faithfulness, Robustness, Complexity).
Compliance Scoring Framework: A novel quantitative tool that translates qualitative legal requirements into a single, comparable compliance score ($0$ to $1$) for different XAI methods.
Algorithmic Benchmarking: A detailed evaluation of 12 state-of-the-art model-agnostic XAI methods, providing specific scores for their performance on legal-relevant properties (e.g., SHAP scores high on faithfulness; Decision Trees score high on complexity/detail).
Sensitivity Analysis: The framework was tested against variations in legal strength factors ( $\lambda$ ). Results showed the rankings are robust; small changes in legal weighting did not alter the top-performing algorithms.

4. Key Results

The study identified distinct "best-fit" algorithms for different regulatory scenarios:

Article 86 (Local, Ex-Post, Individual Explanations):
- Top Performers: SHAP, RuleSHAP, and CEM.
- Reasoning: These methods offer high faithfulness (accurately reflecting the specific instance's decision) and sufficient robustness. Counterfactuals (CEM/DiCE) and rule-based surrogates (Anchors) are strong for complexity/sparsity.
Articles 13–14 (Mixed Scope, Ongoing Information):
- Top Performers: SHAP and RuleSHAP dominate on faithfulness. PDP (Partial Dependence Plots) is highlighted for providing robust, documentable global trends.
- Strategy: The authors suggest a hybrid approach: use SHAP for technical files and risk monitoring, but compress these into rule-based summaries (Decision Trees/RuleFit) for user-facing interfaces to satisfy complexity requirements.
Article 11 & Annex IV (Global, Ex-Ante Documentation):
- Top Performers: SHAP (for technical evidence) and Decision Trees (for human-readable documentation).
- Reasoning: Decision Trees score highest on "Level of Detail" and "Sparsity," making them ideal for the "clear and meaningful" documentation required for high-risk system registration.

General Finding: No single algorithm satisfies all requirements perfectly. SHAP consistently ranks highest for Faithfulness and Robustness, while Rule-based methods (RuleFit, Anchors) and Decision Trees excel in Complexity (sparsity and interpretability).

5. Significance and Impact

Practical Guidance: Provides practitioners with a data-driven method to select XAI tools that align with specific legal obligations, moving beyond "best practice" guesses to regulatory compliance.
Bridging Disciplines: Creates a common language between legal experts and data scientists by translating legal text into technical metrics.
Regulatory Adaptability: The framework is modular. As the EU AI Act evolves or new case law emerges, the weighting factors ( $\lambda$ ) can be updated without re-engineering the entire system.
Risk Mitigation: Helps organizations avoid the "Brussels Effect" compliance trap by identifying technical gaps (e.g., using LIME where high robustness is legally required) before deployment.

Limitations: The scores are relative to the specific set of algorithms evaluated and rely on expert judgment for the initial 1–5 ratings. The authors emphasize that a high compliance score indicates potential alignment but does not guarantee legal compliance in a specific real-world deployment, which requires broader governance and human oversight.

Assessing Model-Agnostic XAI Methods against EU AI Act Explainability Requirements