Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Imagine you are the head chef of a massive, bustling restaurant. You have a huge pantry (your training data) filled with thousands of ingredients. Every day, you create a specific signature dish for a customer (your test instance).

Now, a group of food critics wants to know: Which specific ingredients in your pantry actually made this dish taste so good? They want to assign a "value" or a "score" to every single ingredient to see how much it contributed to the final flavor.

This is the problem of Data Valuation. The gold standard for calculating this score is something called the Shapley Value.

The Old Way: The Exhaustive Chef

Traditionally, to figure out the value of an ingredient, the critics would try a crazy experiment:

They would take the recipe and remove every possible combination of ingredients.
They would cook the dish again with the remaining ingredients.
They would taste it, compare it to the original, and calculate how much that missing ingredient mattered.

The Problem: If you have 1,000 ingredients, the number of possible combinations is astronomical (more than the number of atoms in the universe!). Trying to cook and taste every single combination would take longer than the age of the universe. It's computationally impossible.

The Big Insight: "Local" Cooking

The authors of this paper, Xuan Yang and colleagues, realized something obvious but overlooked: You don't need to check the whole pantry to judge a specific dish.

If you are making a Spaghetti Carbonara, the ingredients that matter are the eggs, bacon, and pasta. The pineapple or the cactus sitting in the back of the pantry? They have absolutely zero effect on the taste of the Carbonara.

In machine learning, this is called Model-Induced Locality.

If a model is a K-Nearest Neighbor (like a "find the closest match" system), only the few data points closest to the customer matter.
If a model is a Decision Tree (like a flowchart), only the specific path the customer took down the tree matters.
If a model is a Graph Neural Network (like a social network), only the friends and friends-of-friends in the immediate circle matter.

The paper argues: Stop trying to taste every combination of the whole pantry. Only taste the combinations of the ingredients that actually go into the dish.

The Solution: LSMR (The Smart Sous-Chef)

The paper proposes a new system called LSMR (Local Shapley via Model Reuse). Think of it as a super-smart sous-chef who manages the kitchen differently.

1. The "Support Set" (The Relevant Pantry)

Instead of looking at the whole pantry, the sous-chef identifies the Support Set: the tiny, specific group of ingredients relevant to this specific dish.

Analogy: If the dish is Carbonara, the Support Set is just {Eggs, Bacon, Pasta, Cheese}. The Cactus is ignored.

2. The "Reuse" Trick (Cooking Once, Serving Many)

Here is the genius part. Even within that small Support Set, there are still many combinations.

Old Way: To check the value of "Bacon," the chef cooks the dish with Bacon, then without Bacon. To check "Eggs," they cook with Eggs, then without. They end up cooking the "Pasta + Cheese" base multiple times just to test different toppings.
LSMR Way: The sous-chef says, "Wait! We are going to cook the 'Pasta + Cheese' base anyway. Let's cook it once, taste it, and then use that result to calculate the value for both the Bacon and the Eggs."

The paper proves mathematically that this is the optimal way to do it. You never cook the same combination twice. You cook every unique, relevant combination exactly once, and then you share the results with everyone who needs them.

The "LSMR-A" Upgrade: The Sampling Chef

What if the Support Set is still too big? (Maybe the dish is a complex stew with 50 relevant ingredients). Cooking every combination is still too slow.

The paper introduces LSMR-A, which is like a Sampling Chef.

Instead of cooking every combination, the chef randomly picks a few combinations to taste.
The Magic: Even though they are just sampling, they still use the "Reuse" trick. If they pick a combination that happens to be relevant to two different dishes, they cook it once and share the result.
This makes the process incredibly fast and accurate, even for huge datasets, because they stop wasting time on irrelevant ingredients (like the cactus) and stop re-cooking the same base dishes.

Why This Matters

Speed: It makes data valuation thousands of times faster. In the experiments, they reduced the time from "days" to "minutes."
Accuracy: By focusing on the right ingredients (the structural locality), they avoid the noise of irrelevant data.
Fairness: It still gives a fair score to every data point, just like the old method, but it gets there without burning out the kitchen.

Summary Analogy

Imagine you are trying to figure out which players on a massive soccer team contributed to a single goal.

The Old Way: You simulate the entire game 10 million times, swapping every possible combination of players on and off the field, just to see who scored.
The New Way (Local Shapley): You realize that for this specific goal, only the 5 players involved in the final play mattered. The goalkeeper and the defenders on the other side of the field didn't touch the ball.
The LSMR Twist: You only simulate the plays involving those 5 players. And if two different goals involved the same 3 players, you simulate that 3-player play once and apply the result to both goals.

The paper essentially teaches us: Don't solve the whole puzzle. Solve the small, relevant piece of the puzzle, and share that solution everywhere it fits.

Here is a detailed technical summary of the paper "Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation."

1. Problem Statement

Data valuation aims to quantify the contribution of individual training data points to a model's performance, typically using the Shapley value from cooperative game theory. While the Shapley value offers fairness guarantees, its exact computation is #P-hard due to the need to evaluate an exponential number of coalitions ($2^{|D|}$).

Existing acceleration methods (e.g., Monte Carlo sampling, influence functions) generally operate over the global coalition space, implicitly assuming that every training point can influence every test point. The authors argue this assumption is overly pessimistic. Modern predictors exhibit structural sparsity: for a specific test instance, only a small subset of training points actually participates in the computational pathway determining the prediction (e.g., $K$ -nearest neighbors, decision tree leaves, or GNN receptive fields). Current methods fail to exploit this model-induced locality, leading to unnecessary retraining of models on irrelevant data subsets.

2. Methodology

The paper introduces a framework that reframes Shapley computation as a structured data processing problem by leveraging Model-Induced Locality.

A. Theoretical Foundation: Local Shapley Value

Support Sets ( $N(t)$ ): For a test point $t$ , the authors define a support set $N(t) \subseteq D$ containing only the training instances that influence the prediction via the model's computational graph (e.g., neighbors in KNN, support vectors in SVM, leaves in trees).
Projected Utility: They define a local utility function $v^N_t(S) = v_t(S \cap N(t))$ .
Approximation Guarantees:
- Exact Locality: If the model's prediction depends strictly on $N(t)$ , the Local Shapley value equals the Global Shapley value.
- Approximate Locality: If influence decays outside $N(t)$ , the error is bounded by the aggregate influence of non-support points (Assumption 1: Additive Non-local Stability).
Intrinsic Complexity: The paper proves that the computational complexity is not governed by the total number of coalitions, but by the number of distinct subsets that influence at least one valuation. This establishes an information-theoretic lower bound on the number of retraining operations required.

B. Exact Algorithm: LSMR (Local Shapley via Model Reuse)

To achieve optimal efficiency, the authors propose LSMR, a subset-centric exact algorithm that eliminates redundancy through three mechanisms:

Subset-Centric Reformulation: Instead of computing marginal contributions per player, the algorithm computes the utility of each distinct subset $S \subseteq N(t)$ once and algebraically distributes the result to all players in $S$ . This removes intra-support redundancy.
Global Support Mapping: A bipartite graph links training points to test points via their support sets.
Pivot-Based Scheduling: To eliminate inter-support redundancy (where overlapping supports across different test points trigger duplicate training), the algorithm assigns a unique "pivot" (canonical evaluator) to every distinct subset $S$ . A subset is trained only when its pivot test point is processed; all other test points reuse this result.

Complexity: LSMR performs exactly $| \mathcal{S} |$ trainings, where $\mathcal{S}$ is the set of all distinct subsets induced by the supports. This matches the theoretical lower bound.

C. Approximate Algorithm: LSMR-A (Reuse-Aware Monte Carlo)

For large support sets where exact enumeration is infeasible, the authors propose LSMR-A, a Monte Carlo estimator that integrates the reuse principles of LSMR:

Mechanism: It samples permutations to generate subsets but checks the "pivot" condition before training. If a sampled subset has already been evaluated by its pivot, the result is reused.
Statistical Guarantees: The estimator remains unbiased and achieves exponential concentration (error probability decreases exponentially with sample count).
Variance Reduction: By restricting sampling to relevant supports and reusing evaluations, LSMR-A reduces variance compared to standard Monte Carlo, especially under distribution shifts where irrelevant points are naturally excluded.

3. Key Contributions

Formalization of Model-Induced Locality: The paper defines support sets based on model architecture (KNN, Trees, SVM, GNN) and proves that Shapley computation can be projected onto these sets with bounded or zero error.
Information-Theoretic Lower Bound: It establishes that the intrinsic complexity of data valuation is determined by the number of distinct influential subsets, not the global coalition space.
Optimal Algorithms (LSMR & LSMR-A):
- LSMR: An exact algorithm that achieves the theoretical lower bound on retraining by ensuring every distinct subset is trained exactly once.
- LSMR-A: A reuse-aware Monte Carlo estimator that decouples sampling complexity from retraining complexity, offering unbiased estimates with lower variance.
Structural Efficiency: The approach transforms Shapley computation from a brute-force enumeration problem into a structured data management problem involving support mapping and pivot scheduling.

4. Experimental Results

The framework was evaluated across four model families (Weighted KNN, Decision Trees, RBF-SVM, GNN) and diverse datasets (MNIST, Iris, Breast Cancer, Cora).

Fidelity (RQ1): Local Shapley values show strong correlation with Global Shapley values (Pearson $r$ up to 0.84 for KNN). The approximation error is negligible when locality is exact or when non-local interactions decay rapidly.
Downstream Utility (RQ2): Data selection based on Local Shapley (LSMR-A) matches or exceeds the performance of global estimators. In some cases (e.g., KNN), using only 10% of locally selected data achieved accuracy comparable to 25% of globally selected data.
Efficiency (RQ3):
- Retraining Reduction: LSMR-A reduced the number of model trainings by 3 orders of magnitude (e.g., from 1.1 billion to 0.9 million for KNN) compared to Global-MC.
- Speedup: Achieved speedups of over 1000x in some scenarios.
- Scalability: As dataset size ( $|D|$ ) grows, the amortized cost per test point for LSMR-A decreases (sublinear growth), whereas global baselines scale exponentially or linearly with high constants.
Sensitivity (RQ4 & RQ5):
- Increasing support size improves fidelity but with diminishing returns; small supports often suffice for high-quality data selection.
- Alignment is Critical: The support set must align with the evaluation model's architecture. Using a mismatched support (e.g., GNN supports for a KNN model) significantly degrades performance, confirming that locality is model-specific.

5. Significance

This work fundamentally shifts the paradigm of data valuation from a global, combinatorial problem to a local, structural data processing problem.

Theoretical Impact: It provides the first information-theoretic lower bound for Shapley computation based on intrinsic subset complexity, proving that existing methods are suboptimal due to redundant retraining.
Practical Impact: By enabling the efficient valuation of massive datasets on complex models (like GNNs) where global Shapley is computationally infeasible, it opens the door to scalable, fair, and principled data marketplaces and dataset curation.
Broader Application: The concept of "model-induced locality" and "optimal reuse" can be applied to other areas of machine learning where retraining costs are high, such as federated learning and dataset debugging.