Exact Functional ANOVA Decomposition for Categorical Inputs Models

This paper presents a computationally efficient, closed-form Functional ANOVA decomposition for categorical inputs that resolves the limitations of sampling-based approximations by extending discrete Fourier analysis to handle arbitrary dependence structures and non-rectangular supports, while naturally generalizing SHAP values to the general categorical setting.

Baptiste Ferrere, Nicolas Bousquet, Fabrice Gamboa, Jean-Michel Loubes, Joseph Muré

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a complex machine, like a high-tech coffee maker, that takes in various ingredients (water, beans, milk, sugar) and produces a perfect cup of coffee. You want to know: Exactly how much did each ingredient contribute to the taste? And more importantly, did the combination of "beans + milk" create a special flavor that neither had on its own?

In the world of Artificial Intelligence (AI), this is called Explainability. We want to understand why a model made a specific prediction.

For a long time, scientists had a great tool for independent ingredients (where the amount of water doesn't change the type of bean). This tool is called Functional ANOVA. It breaks a prediction down into a sum of simple parts:

  • Main Effects: How much did the water contribute alone?
  • Interactions: How much did the "water + beans" combo contribute?

The Problem:
Real-world data is messy. Ingredients are often linked. For example, if you buy "organic" beans, you might also buy "fair-trade" sugar. They are dependent. When ingredients are dependent, the old math breaks down. The formulas become so complex that computers can't solve them exactly; they have to guess using slow, expensive simulations (like trying to taste every possible coffee combination in the universe).

The Solution (This Paper):
The authors, Baptiste Ferrere and his team, have invented a magic calculator for categorical data (data that falls into categories, like "Red/Blue/Green" or "Yes/No/Maybe").

Here is the breakdown of their breakthrough using simple analogies:

1. The "Lego" Analogy

Imagine your data is a giant box of Lego bricks.

  • The Old Way: If the bricks were all different and unrelated, you could easily count how many red, blue, and green bricks you used. But if the bricks were glued together in weird, unpredictable shapes (dependencies), counting them became a nightmare. You had to take the whole structure apart brick by brick, which took forever.
  • The New Way: The authors found a way to look at the Lego structure and instantly see the "blueprint." They created a closed-form formula. This means they didn't need to guess or simulate; they wrote a direct mathematical recipe that tells you exactly how much each brick (and each group of glued bricks) contributed to the final shape.

2. The "Detective" Analogy

Imagine you are a detective trying to solve a crime (the AI's prediction).

  • The Suspects: The features (e.g., "Age," "Income," "Location").
  • The Twist: The suspects are friends. If "Location" is "New York," "Income" is likely "High." They are dependent.
  • The Old Detective: Had to interview thousands of people, simulate thousands of scenarios, and hope to get a rough idea of who did it.
  • The New Detective: Has a super-powerful lens. Because the data is categorical (like "Male/Female" or "City A/B"), the new method can instantly separate the guilt of the individual from the guilt of the group. It can say, "The 'New York' location contributed 20%, but the 'High Income' contributed 10%, and the fact that they always appear together contributed another 5%."

3. The "Sparse Library" Analogy

The paper tackles a huge problem: Sparsity.
Imagine a library with 100 trillion possible books (all combinations of inputs), but the library only actually has 10,000 books on the shelves.

  • The Old Problem: Traditional math tries to calculate the value of all 100 trillion books, even the ones that don't exist. This is impossible and slow.
  • The New Trick: The authors realized, "Hey, we only have 10,000 books! Let's just focus on those." They built a system that ignores the empty shelves and only calculates the importance of the books that actually exist. This makes the calculation instantly fast, even for massive datasets.

Why This Matters (The "So What?")

  1. Speed: What used to take hours or days of computer time now takes seconds.
  2. Accuracy: It gives the exact answer, not an approximation. No more guessing.
  3. Realism: It works for messy, real-world data where variables are linked (like the "organic beans" example), which most previous tools couldn't handle well.
  4. SHAP Values: It creates a new, better version of "SHAP values" (a popular tool for explaining AI). It's like upgrading from a blurry photo to a 4K HD image of why the AI made a decision.

In a Nutshell

The authors took a complex mathematical puzzle that was stuck in the "too hard to solve" pile and solved it for a very common type of data (categories). They built a direct, fast, and exact map that shows us exactly how different factors and their combinations drive AI decisions, even when those factors are messy and linked together.

This means we can finally trust AI models with categorical data (like medical records, financial forms, or survey results) much more, because we can finally see the "engine" running underneath the hood.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →