Exact Functional ANOVA Decomposition for Categorical Inputs Models

Imagine you have a complex machine, like a high-tech coffee maker, that takes in various ingredients (water, beans, milk, sugar) and produces a perfect cup of coffee. You want to know: Exactly how much did each ingredient contribute to the taste? And more importantly, did the combination of "beans + milk" create a special flavor that neither had on its own?

In the world of Artificial Intelligence (AI), this is called Explainability. We want to understand why a model made a specific prediction.

For a long time, scientists had a great tool for independent ingredients (where the amount of water doesn't change the type of bean). This tool is called Functional ANOVA. It breaks a prediction down into a sum of simple parts:

Main Effects: How much did the water contribute alone?
Interactions: How much did the "water + beans" combo contribute?

The Problem:
Real-world data is messy. Ingredients are often linked. For example, if you buy "organic" beans, you might also buy "fair-trade" sugar. They are dependent. When ingredients are dependent, the old math breaks down. The formulas become so complex that computers can't solve them exactly; they have to guess using slow, expensive simulations (like trying to taste every possible coffee combination in the universe).

The Solution (This Paper):
The authors, Baptiste Ferrere and his team, have invented a magic calculator for categorical data (data that falls into categories, like "Red/Blue/Green" or "Yes/No/Maybe").

Here is the breakdown of their breakthrough using simple analogies:

1. The "Lego" Analogy

Imagine your data is a giant box of Lego bricks.

The Old Way: If the bricks were all different and unrelated, you could easily count how many red, blue, and green bricks you used. But if the bricks were glued together in weird, unpredictable shapes (dependencies), counting them became a nightmare. You had to take the whole structure apart brick by brick, which took forever.
The New Way: The authors found a way to look at the Lego structure and instantly see the "blueprint." They created a closed-form formula. This means they didn't need to guess or simulate; they wrote a direct mathematical recipe that tells you exactly how much each brick (and each group of glued bricks) contributed to the final shape.

2. The "Detective" Analogy

Imagine you are a detective trying to solve a crime (the AI's prediction).

The Suspects: The features (e.g., "Age," "Income," "Location").
The Twist: The suspects are friends. If "Location" is "New York," "Income" is likely "High." They are dependent.
The Old Detective: Had to interview thousands of people, simulate thousands of scenarios, and hope to get a rough idea of who did it.
The New Detective: Has a super-powerful lens. Because the data is categorical (like "Male/Female" or "City A/B"), the new method can instantly separate the guilt of the individual from the guilt of the group. It can say, "The 'New York' location contributed 20%, but the 'High Income' contributed 10%, and the fact that they always appear together contributed another 5%."

3. The "Sparse Library" Analogy

The paper tackles a huge problem: Sparsity.
Imagine a library with 100 trillion possible books (all combinations of inputs), but the library only actually has 10,000 books on the shelves.

The Old Problem: Traditional math tries to calculate the value of all 100 trillion books, even the ones that don't exist. This is impossible and slow.
The New Trick: The authors realized, "Hey, we only have 10,000 books! Let's just focus on those." They built a system that ignores the empty shelves and only calculates the importance of the books that actually exist. This makes the calculation instantly fast, even for massive datasets.

Why This Matters (The "So What?")

Speed: What used to take hours or days of computer time now takes seconds.
Accuracy: It gives the exact answer, not an approximation. No more guessing.
Realism: It works for messy, real-world data where variables are linked (like the "organic beans" example), which most previous tools couldn't handle well.
SHAP Values: It creates a new, better version of "SHAP values" (a popular tool for explaining AI). It's like upgrading from a blurry photo to a 4K HD image of why the AI made a decision.

In a Nutshell

The authors took a complex mathematical puzzle that was stuck in the "too hard to solve" pile and solved it for a very common type of data (categories). They built a direct, fast, and exact map that shows us exactly how different factors and their combinations drive AI decisions, even when those factors are messy and linked together.

This means we can finally trust AI models with categorical data (like medical records, financial forms, or survey results) much more, because we can finally see the "engine" running underneath the hood.

1. Problem Statement

The paper addresses a critical gap in the interpretability of machine learning models with categorical inputs, particularly when these inputs are dependent or the data distribution has non-rectangular support (sparse data).

Context: Functional ANOVA (Analysis of Variance) provides a principled way to decompose a model's prediction into main effects and higher-order interactions. For independent features, this decomposition is well-defined and linked to SHAP values.
The Limitation: For general dependent distributions, no explicit closed-form expression exists. Practitioners currently rely on costly sampling-based approximations (e.g., KernelSHAP) or methods restricted to specific model types (e.g., TreeHFD for tree ensembles).
The Challenge: Existing methods like Boolean Fourier analysis fail for categorical data because:
1. They assume i.i.d. Bernoulli inputs (parameter 0.5), which rarely holds in practice.
2. One-hot encoding categorical variables creates fictitious interactions between binary variables, making brute-force Fourier analysis invalid for functional ANOVA.
Goal: To derive a closed-form, exact decomposition for categorical inputs that handles arbitrary dependence structures and sparse supports without relying on sampling approximations.

2. Methodology

The authors bridge functional analysis with an extension of discrete Fourier analysis to construct a generalized basis for categorical domains.

A. Mathematical Framework

Setting: Inputs $X$ are categorical, taking values in a finite support $X \subseteq E$ , where $E$ is a hypergrid of categories. The measure $\nu$ is the counting measure.
Generalized Functional ANOVA: The goal is to decompose a function $f(X)$ into $f(X) = \sum_{A \subseteq [d]} f_A(X_A)$ , satisfying hierarchical orthogonality: information in a set $A$ must be orthogonal to any information in its proper subsets $B \subset A$ .
The Basis Extension (Definition 3.1): The authors introduce a family of functions $\{\phi^{(z)}_A\}$ ${ϕ_{A}^{(z)}}$ that extends the Walsh–Hadamard basis (parity functions) to general categorical variables.
- These functions act as signed inverse likelihoods.
- They are defined using the probability mass function (pmf) of the inputs, allowing them to adapt to the specific data distribution (including dependencies).
- Formula: $\phi^{(z)}_A(x) = \frac{\prod_{i \in A} (1\{x_i = z_i\} - 1\{x_i = N_i - 1\})}{p_A(x_A)}$ .

B. Closed-Form Decomposition

Fourier Expansion: Any function $f \in L^2$ can be expanded as $f(X) = \sum c^{(z)}_A(f) \cdot \phi^{(z)}_A(X)$ .
Linear System: The coefficients $c^{(z)}_A(f)$ $c_{A}^{(z)} (f)$ are found by solving a linear system $\Gamma c(f) = \mu(f)$ $Γ c (f) = μ (f)$ , where:
- $\Gamma$ is a Gram matrix of inner products between basis functions.
- $\mu$ is a vector of inner products between the target function $f$ and the basis functions.
Handling Sparsity (Theorem 4.1): In real-world scenarios, the support $X$ $X$ is often a small subset of the full hypergrid $E$ $E$ (i.e., $|X| = r \ll |E|$ $∣ X ∣ = r ≪ ∣ E ∣$ ).
- The full basis is overcomplete. The authors prove that a subset of exactly $r$ basis vectors exists that forms a valid basis for the empirical space.
- Algorithm 1 (Greedy Approach): A greedy algorithm selects basis vectors sequentially to maximize rank until the dimension of the empirical support is reached.
- Low-Rank Approximation: To manage computational cost, the process can be truncated at a lower rank $r_{low}$ , trading a small amount of reconstruction fidelity for massive gains in speed and interpretability.

C. Connection to SHAP

The paper establishes that under independence, this decomposition recovers the standard ANOVA and SHAP values exactly.
In the general dependent case, the components $f_A$ act as Harsanyi dividends, allowing for a natural generalization of SHAP values: $shap_i(x) = \sum_{A \ni i} \frac{f_A(x_A)}{|A|}$ .

3. Key Contributions

Exact Closed-Form Solution: The first derivation of an exact, closed-form Functional ANOVA decomposition for categorical inputs with arbitrary dependence structures and sparse supports.
Generalized Basis: Introduction of a new basis family that satisfies hierarchical orthogonality for non-rectangular supports, extending beyond Boolean analysis.
Computational Efficiency: A formulation that avoids sampling. Once the decomposition is computed (a one-time global cost), attributions for any number of samples are instantaneous.
Theoretical Guarantees: Proofs showing that the method recovers standard ANOVA/SHAP in independent settings and provides unique decompositions under full support or fixed basis selection.
Scalability: Demonstration that the method works on high-dimensional, sparse tabular data (e.g., 100+ features) where exhaustive methods fail.

4. Experimental Results

The authors validated their framework on synthetic and real-world datasets:

Synthetic Dependency Test: Successfully identified that a variable $X_3$ perfectly dependent on $X_2$ contributed no independent interaction term, correctly isolating the true functional dependencies.
Independent Setting (CAR EVALUATION, NURSERY):
- Compared against KernelSHAP (approximation).
- Result: The proposed method achieved near-zero Integrated Squared Error (ISE) compared to analytical SHAP values.
- Speed: Computed exact decompositions in 0.5s, whereas KernelSHAP took 54s (with 200 background samples).
Sparse High-Dimensional Data (MUSHROOMS, POKER, CONNECT-4, DOTA2):
- MUSHROOMS: Achieved $R^2 \approx 1$ with only main effects, identifying "odor" as the dominant feature, matching domain knowledge.
- High-Dim Datasets: The method handled datasets with $d=113$ (DOTA2) and $d=42$ (CONNECT-4).
- Performance: Main effects were isolated in seconds (e.g., 10s for CONNECT-4). Higher-rank approximations (up to $R^2=0.79$ ) were computed in under 40 minutes, a task previously intractable for exact methods.
Binarized MNIST: Applied to a 784-feature binary dataset. By leveraging spatial structure (pruning inactive pixels and searching local neighborhoods), the method explained 60,000 samples in competitive time, producing visual attributions consistent with digit shapes.

5. Significance and Impact

Paradigm Shift: Moves the field from sampling-based approximations to exact, deterministic decompositions for categorical data.
Trustworthiness: Provides a theoretically grounded alternative to "black-box" approximations, crucial for high-stakes domains (finance, healthcare) where understanding model mechanics is vital.
Efficiency: The "one-time global computation" model allows for real-time explanation of large batches of data, overcoming the latency issues of current SHAP implementations.
Future Direction: The paper suggests that incorporating domain-specific structural constraints (like spatial locality in images or sequences) can further mitigate the curse of dimensionality, paving the way for extending these exact methods to continuous domains.

In summary, this work resolves a long-standing theoretical limitation in explainable AI (XAI) by providing a mathematically rigorous, computationally efficient, and exact framework for decomposing models with categorical inputs, regardless of feature dependencies.