Towards Attributions of Input Variables in a Coalition

Imagine you are trying to figure out why a specific team won a soccer match.

You have a list of players: Striker, Midfielder, Defender, and Goalkeeper.

Old Method: You ask, "How much did the Striker contribute?" Then you ask, "How much did the Midfielder contribute?" You add them up, and you get a total score.
The Problem: What if the Striker and Midfielder are best friends who always pass the ball to each other? If you look at them separately, you might miss the magic that happens when they work together. Or worse, if you group them as a "Forward Duo" and ask, "How much did the Duo contribute?", the math might get weird. The "Duo's" score might not equal the sum of the two individuals' scores.

This paper is about fixing that math so we can understand AI models (like the ones that recognize cats in photos or write essays) without getting confused about how to group their "ingredients."

Here is the breakdown in simple terms:

1. The Core Problem: The "Grouping" Confusion

In AI, we want to know which parts of the input (like words in a sentence or pixels in an image) are responsible for the decision.

The Conflict: If you look at the word "raining" alone, it might seem important. If you look at "cats" alone, it might seem important. But the phrase "raining cats and dogs" is a specific idiom meaning "heavy rain."
The Issue: If you treat "raining," "cats," and "dogs" as separate players, you might miss the joke. But if you group them into a "Coalition" (a team), the math used to calculate their importance often breaks. The score for the group doesn't match the sum of the individuals. It's like saying the whole pie is worth \ $10, but the slices add up to \$ 15. That doesn't make sense!

2. The Solution: The "Recipe" Analogy

The authors realized that AI models work like a complex recipe. They have two types of interactions:

The "AND" Interaction (The Secret Sauce): This happens when everything in a group must be present for a specific effect.
- Example: To make a "Heavy Rain" prediction, the AI needs "raining" AND "cats" AND "dogs" all at once. If you remove one, the "Heavy Rain" effect disappears.
The "OR" Interaction (The Backup Plan): This happens if any one of a group is present.
- Example: To predict "Negative Mood," the AI might react if it sees "boring" OR "sad" OR "terrible." You only need one of them to trigger the effect.

The Big Discovery:
The authors proved that the confusion (the math conflict) happens because of the "Partial Groups."

Imagine a group of friends: Alice, Bob, and Charlie.
The AI has a rule: "Alice + Bob + Charlie = Great Party."
But it also has a rule: "Alice + Bob = Good Party."
If you try to calculate the value of the "Alice + Bob" team, the math gets messy because the AI is counting them in two different ways (once as a pair, once as part of a trio).

3. The New Tool: "Coalition Attribution"

The paper proposes a new way to calculate importance that respects these groups.

Old Way: Just add up individual scores. (Result: Confusion).
New Way: Look at the "Recipe." If a group of variables (like "raining cats and dogs") acts as a single unit in the AI's brain, the new math gives them a single, fair score. It acknowledges that sometimes, the whole is indeed different from the sum of its parts.

They created three "Trust Scores" to check if a group is a real team or just a random bunch of people standing together:

Is this group a "True Team"? (Do they always work together in the AI's logic?)
Is this group "Fake"? (Did we just randomly grab these words together, even though the AI doesn't see them as a unit?)
Is this group "Mixed"? (Are they sometimes a team, but sometimes just individuals?)

4. Real-World Tests

The authors tested this on:

Language: They checked if the AI understood phrases like "raining cats and dogs" as a single unit. It did!
Images: They checked if the AI saw a "horse's head" as a single object made of pixels, rather than just random pixels. It did!
Go (The Board Game): This was the coolest part. They used their method to explain why a professional Go AI (KataGo) made a move.
- Human players memorize "shapes" (patterns of stones).
- The AI found patterns that humans didn't know existed! The new math helped humans understand why the AI liked a specific shape, revealing new strategies that even experts hadn't seen before.

The Takeaway

This paper is like a translator for AI.
Previously, if you asked an AI, "Why did you do that?" and tried to group the reasons, the answer was often mathematically broken.
Now, the authors have given us a new set of glasses. These glasses allow us to see groups of inputs (like phrases, image patches, or game patterns) as legitimate "teams" with their own value, without breaking the math. This helps us trust AI more and even learn new things from it (like new Go strategies).

1. Problem Statement

In Explainable AI (XAI), particularly methods based on Shapley values, a fundamental challenge exists regarding the partitioning of input variables.

The Conflict: Current methods compute attributions based on a predefined partition (e.g., pixels vs. superpixels in images, or tokens vs. words in NLP). However, there is no theoretical guidance on how to form meaningful "coalitions" (groups of variables).
The Core Issue: A conflict arises when the attribution of a coalition $S$ (denoted $\phi(S)$ ) is not equal to the sum of the attributions of its individual constituent variables ( $\sum_{i \in S} \phi(i)$ ).
Limitation of Existing Work: Previous approaches attempt to resolve this conflict using engineering heuristics (e.g., adding loss functions to force consistency) without explaining the underlying mathematical cause. The paper argues that this conflict is not a bug to be fixed, but a natural consequence of how AI models encode interactions.

2. Methodology

The authors propose a theoretical framework based on AND-OR interactions to explain and resolve the attribution conflict.

A. Theoretical Foundation: AND-OR Interactions

The paper utilizes the concept that any AI model output can be decomposed into numerical effects of AND interactions (all variables in a set must be present to trigger an effect) and OR interactions (any variable in a set triggers an effect).

Reformulation of Shapley Value: The authors prove that the Shapley value $\phi(i)$ for a variable $i$ can be reformulated as a uniform allocation of interaction effects:
$\phi(i) = \sum_{S \subseteq N, i \in S} \frac{1}{|S|} [I_{\text{and}}(S) + I_{\text{or}}(S)]$
where $I_{\text{and}}$ and $I_{\text{or}}$ are the interaction effects.

B. Defining Coalition Attribution

Instead of forcing consistency, the authors define a new attribution metric for a coalition $S$ , denoted $\phi(S)$ , based on the same interaction principles:
$\phi(S) = \sum_{T \supseteq S} \frac{|S|}{|T|} [I_{\text{and}}(T) + I_{\text{or}}(T)]$
This metric allocates the effect of an interaction $T$ to coalition $S$ proportionally to the size of $S$ relative to $T$ , but only if $S$ is fully contained within $T$ .

C. Explaining the Conflict

The paper provides a rigorous mathematical proof (Theorem 3.4) that the conflict $\phi(S) \neq \sum_{i \in S} \phi(i)$ is caused by partial interactions.

Shared Component: Interactions $T$ where $T \supseteq S$ contribute to both the coalition attribution and the sum of individual attributions.
Conflict Component: Interactions $T$ where $T$ contains some but not all variables of $S$ (i.e., $T \cap S \neq \emptyset$ and $T \cap S \neq S$ ) contribute to the individual attributions $\phi(i)$ but are excluded from the coalition attribution $\phi(S)$ .
Conclusion: The conflict is mathematically inevitable unless the model treats the coalition $S$ as an atomic unit (i.e., no interactions exist that partially overlap with $S$ ).

D. Metrics for Coalition Faithfulness

To evaluate whether a specific grouping of variables forms a "faithful" coalition, the authors propose three metrics:

$R(i)$ : Measures if the shared component dominates the individual attribution for variable $i$ .
$R'(i)$ : A fine-grained measure of the significance of variable $i$ within the coalition relative to its total interactions.
$Q(S)$ : Measures the overall faithfulness of the entire coalition $S$ by comparing the strength of interactions covering the whole coalition against those covering only parts of it.

3. Key Contributions

Mechanism Discovery: The paper identifies and proves that the conflict between individual and coalition attributions is caused by AND-OR interactions that partially overlap with the coalition.
New Metric: It extends the Shapley value to define a formal Coalition Attribution ( $\phi(S)$ ) that is theoretically consistent with the interaction-based decomposition of the model.
Faithfulness Evaluation: It introduces three quantitative metrics ( $R, R', Q$ ) to determine if a user-defined partition (coalition) aligns with the internal logic of the neural network.
Axiomatic Validation: The proposed coalition attribution satisfies standard game-theoretic axioms (Anonymity, Symmetry, Additivity, Dummy, and Efficiency).

4. Experimental Results

The authors validated their approach across synthetic data, NLP, image classification, and the game of Go.

Synthetic Data: On toy functions with known ground-truth interactions, the metrics successfully identified "purely faithful" coalitions (values $\approx 1$ ) versus "unfaithful" ones (values $\approx 0$ ).
NLP (Sentiment Analysis): Using BERT and LLaMA on the SST-2 dataset:
- Natural phrases (e.g., "mesmerizing performances") were identified as faithful coalitions with high $Q(S)$ scores.
- Artificially split phrases (e.g., "rivaling blair" from "blair witch") were correctly identified as unfaithful with low scores.
Image Classification: On MNIST and CIFAR-10 using VGG-11 and ResNet-20:
- Manually selected semantic regions (e.g., a horse's head) showed high faithfulness metrics.
- Randomly selected regions showed low metrics.
Game of Go: Applied to the KataGo engine:
- The method identified shape patterns (coalitions of stones) that boosted or reduced the advantage score.
- The results aligned with human expert intuition for standard patterns (e.g., "shoulder-hit") and revealed new, complex patterns learned by the AI that were not immediately obvious to humans.

5. Significance

Theoretical Clarity: The paper shifts the paradigm from "fixing" attribution conflicts to understanding them. It clarifies that a coalition is only a valid basic unit if the model does not encode interactions that partially overlap with it.
Practical Guidance: The proposed metrics provide a tool for researchers and practitioners to automatically evaluate whether their chosen input granularity (e.g., words vs. subwords, pixels vs. patches) is semantically meaningful for a specific model.
Human-AI Alignment: By quantifying "faithfulness," the method helps bridge the gap between human intuition and AI internal mechanisms, as demonstrated in the Go game application where the AI's learned patterns were validated and explained to human experts.