On the Impact of the Utility in Semivalue-based Data Valuation

Imagine you are the head chef of a busy restaurant. You have a massive pantry full of ingredients (your data). Some ingredients are fresh and delicious, some are stale, and some are actually poisonous. To make the best dish (your AI model), you need to know which ingredients are the stars and which ones should be thrown out.

This is the problem of Data Valuation: figuring out how much each single ingredient contributes to the final taste of the dish.

The Problem: "What does 'Good' mean?"

In the world of AI, we use a mathematical tool called Semivalues (based on game theory) to score these ingredients. But here's the catch: to give a score, you first have to decide what "Good" looks like.

Scenario A (The Trade-off): Imagine you are training a robot to be helpful but also harmless. You have to balance two goals. If you care 90% about being helpful and 10% about being harmless, the robot learns differently than if you care 50/50. The "recipe" for success changes based on your priorities.
Scenario B (The Ambiguity): Imagine you are training a dog vs. cat classifier. Is success measured by how many dogs it gets right? Or how many cats? Or the average of both? There are many valid ways to measure "success," and they often disagree on which ingredients are the best.

The Big Question: If I change my definition of "Good" (my Utility), does my list of top ingredients change completely? If I switch from "Accuracy" to "F1-Score," do I suddenly decide to throw away the best ingredients and keep the worst ones? If the answer is yes, then the whole exercise is risky and unreliable.

The Solution: The "Spatial Signature" Map

The authors of this paper came up with a brilliant way to visualize this. They invented something called a Spatial Signature.

Think of your dataset as a group of people at a party.

Usually, we just give everyone a single score (like a number on a report card).
The authors say: "Let's stop giving them a number. Let's give them a location on a map."

They take every single data point and plot it on a 2D map (or a 3D map if there are more goals).

The X-axis represents how good the ingredient is for Goal A (e.g., Accuracy).
The Y-axis represents how good it is for Goal B (e.g., F1-Score).

Now, instead of a list of numbers, you have a cloud of dots on a map.

The Magic: The "Utility Compass"

Here is the magic trick:

Your choice of "Utility" (what you value) is like a Compass or a Flashlight shining on this map.
If you shine the light from the North (valuing Goal A), the people who are furthest North look like the winners.
If you shine the light from the East (valuing Goal B), the people furthest East look like the winners.

The Robustness Test:
The paper asks: How much do I have to rotate my flashlight before the "winners" change?

Unstable Situation: Imagine the dots are scattered all over the room in a messy circle. If you rotate your flashlight just a tiny bit, a completely different group of people becomes the "winners." This is fragile. You can't trust the ranking because it depends entirely on which way you point the light.
Stable Situation: Imagine all the dots are lined up in a single straight line. No matter how you rotate your flashlight (unless you point it exactly sideways), the order of the people on the line stays the same. The person at the front is always at the front. This is robust.

The Metric: "How far can I spin before things flip?"

The authors created a score called $R_p$ (Robustness Metric).

It calculates the average amount you have to rotate your flashlight before the ranking of your ingredients changes significantly.
High Score: You can spin the flashlight almost all the way around, and the top ingredients stay the same. (Great! Trust this data valuation.)
Low Score: A tiny nudge of the flashlight changes the top ingredients. (Danger! The results are arbitrary.)

The Surprise Discovery: The "Banzhaf" Hero

The paper tested three different ways of calculating these scores (Shapley, Beta Shapley, and Banzhaf).

They found that the Banzhaf method is like a super-stable anchor.

Shapley and Beta Shapley tend to scatter the dots all over the map. The ranking changes easily when you change your goals.
Banzhaf tends to pull all the dots into a tight, straight line. Because the dots are lined up, the ranking is incredibly stable. No matter how you tweak your definition of "Good," Banzhaf keeps the same top ingredients at the top.

Why This Matters to You

If you are a data scientist or a business leader using AI:

Don't just trust the numbers. If you use a method that isn't robust, your decision to keep or delete data might just be an accident of how you defined your goals.
Check the "Spatial Signature." Before you spend millions of dollars retraining your model based on a data valuation, run this robustness check.
Consider Banzhaf. If you need a ranking that won't flip-flop every time you tweak a parameter, the Banzhaf method seems to be the most reliable "compass" for finding your best data.

In short: This paper gives you a ruler to measure how shaky your data rankings are. It tells you when your "best ingredients" list is solid gold, and when it's just a house of cards waiting to collapse if you change your mind about what "good" means.

Here is a detailed technical summary of the paper "On the Impact of the Utility in Semivalue-Based Data Valuation" (ICLR 2026).

1. Problem Statement

Data valuation aims to quantify the contribution of individual data points to a downstream machine learning task, typically using cooperative game theory concepts known as semivalues (e.g., Shapley, Banzhaf, Beta Shapley). These methods assign scores based on a utility function $u(S)$ , which measures the performance of a model trained on a subset $S$ .

The paper identifies a critical vulnerability: the sensitivity of data valuation rankings to the choice of utility function.

Scenario A (Utility Trade-off): Practitioners often define utility as a convex combination of competing criteria (e.g., $u_\nu = \nu \cdot \text{Helpfulness} + (1-\nu) \cdot \text{Harmlessness}$ ). Small changes in the weight $\nu$ can drastically alter which data points are deemed most valuable.
Scenario B (Multiple Valid Utilities): In many tasks (e.g., classification), multiple metrics (Accuracy, F1-score, Recall) are equally defensible. However, the paper demonstrates that rankings of data points can vary significantly depending on which valid metric is chosen, rendering the valuation unstable and potentially unreliable as a heuristic.

The core question addressed is: How robust are semivalue-based data valuation results to changes in the utility function?

2. Methodology

The authors propose a unified geometric framework to analyze and quantify this robustness.

A. Spatial Signature and Geometric Embedding

The key theoretical insight is that for any semivalue $\phi$ , the score of a data point $z$ under a utility $u_\alpha$ (which is a linear combination of base utilities) can be expressed as an inner product:
$\phi(z; \omega, u_\alpha) = \langle \psi_{\omega,D}(z), \alpha \rangle$
where:

$\alpha$ represents the coefficients of the utility function in the space of base utilities.
$\psi_{\omega,D}(z) \in \mathbb{R}^K$ is the spatial signature of data point $z$ , embedding it into a lower-dimensional space based on its marginal contributions to the base utilities.
This transforms the problem of ranking data points into a geometric problem: sorting points based on their projection onto the direction vector $\alpha$ .

B. Robustness Metric ( $R_p$ )

To quantify stability, the authors define a metric $R_p$ based on the geometry of the spatial signature:

Ranking Regions: As the utility direction $\alpha$ rotates on the unit sphere ( $S^{K-1}$ ), the ranking of data points changes only when $\alpha$ crosses "cut" hyperplanes defined by pairs of points where their scores are equal.
Swap Distance: For a given starting utility direction, $\rho_p(\alpha)$ is the minimal geodesic distance one must rotate $\alpha$ to induce exactly $p$ pairwise swaps in the ranking.
The Metric: $R_p$ $R_{p}$ is the normalized average minimal rotation distance required to cause $p$ $p$ swaps.
- $R_p \approx 1$ : High robustness. The ranking is stable; one must rotate the utility significantly to change the order.
- $R_p \approx 0$ : Low robustness. The ranking is fragile; tiny changes in utility cause large ranking shifts.
- The metric is normalized such that the maximum possible value (achieved when all points are collinear) is 1.

C. Computational Efficiency

While computing semivalues is expensive (often requiring Monte Carlo sampling), computing $R_p$ is efficient. Once the semivalue scores for the base utilities are obtained, the spatial signature is fixed. The calculation of $R_p$ involves sorting cut angles and integrating over arcs, with a complexity of $O(n^2 \log n)$ , which is negligible compared to the model training overhead.

3. Key Contributions

Unified Geometric Modeling: The paper establishes that both the "utility trade-off" and "multiple valid utility" scenarios can be modeled as projecting a dataset's spatial signature onto a utility direction vector. This provides a simple geometric interpretation of data valuation stability.
Practical Robustness Metric: Introduction of $R_p$ , a computable metric that allows practitioners to assess a priori whether their data valuation results are trustworthy given the ambiguity of their utility choice.
Analytical Insight into Semivalue Selection: The paper provides theoretical and empirical evidence that the Banzhaf value consistently yields higher robustness than Shapley or Beta Shapley.
- Reasoning: Banzhaf weights concentrate on intermediate coalition sizes. Empirical analysis shows that marginal contributions at these sizes are highly correlated across different utility metrics, causing the spatial signature to become nearly collinear. Collinearity minimizes the number of ranking regions, maximizing the distance to the next swap.

4. Empirical Results

The authors evaluated their methodology across diverse datasets (e.g., Breast, Titanic, Credit, Heart) and semivalues (Shapley, (4,1)-Beta Shapley, Banzhaf).

Correlation with Rank Stability: There is a strong agreement between the proposed $R_p$ metric and traditional rank-correlation measures (Kendall and Spearman). Datasets with low rank correlation between different utilities (e.g., Accuracy vs. F1) consistently showed low $R_p$ scores.
Banzhaf Superiority: Across almost all datasets and scenarios (binary classification, multiclass, regression), the Banzhaf semivalue achieved the highest $R_p$ scores.
- Example: On the TITANIC dataset, Shapley and Beta Shapley showed negative or low correlations between Accuracy and F1 rankings, while Banzhaf maintained high correlation. Geometrically, Banzhaf embeddings were nearly collinear, whereas Shapley embeddings were scattered.
Utility Trade-offs: In scenarios involving trade-offs (e.g., MSE vs. MAE), Banzhaf again demonstrated the most stable rankings as the trade-off weight $\nu$ varied.
Top-k Stability: High $R_p$ values correlated with high Top-k overlap and Jaccard similarity, confirming that robustness metrics predict the stability of the most valuable data subsets.

5. Significance and Implications

Practitioner Guidance: The paper provides a tool for ML practitioners to determine if data valuation is a "safe" investment. If $R_p$ is low, the practitioner is warned that their choice of utility is arbitrary and that the resulting data rankings may be unstable, potentially leading to costly re-training or suboptimal data selection.
Method Selection: The findings suggest that Banzhaf is a more robust choice for data valuation when the utility function is ambiguous or subject to change, due to its geometric properties that align data points more closely.
Theoretical Advancement: By moving from purely algorithmic approximations to a geometric understanding of utility sensitivity, the paper bridges cooperative game theory with practical robustness analysis, addressing a gap where previous work focused mainly on efficient computation rather than stability under utility shifts.

In summary, this work argues that data valuation is not just about computing scores, but about understanding the stability of those scores relative to the practitioner's objectives. It offers a geometric framework to quantify this stability and identifies Banzhaf as a superior semivalue for robust data valuation.

On the Impact of the Utility in Semivalue-based Data Valuation

The Problem: "What does 'Good' mean?"

The Solution: The "Spatial Signature" Map

The Magic: The "Utility Compass"

The Metric: "How far can I spin before things flip?"

The Surprise Discovery: The "Banzhaf" Hero

Why This Matters to You

1. Problem Statement

2. Methodology

A. Spatial Signature and Geometric Embedding

B. Robustness Metric (RpR_pRp​)

C. Computational Efficiency

3. Key Contributions

4. Empirical Results

5. Significance and Implications

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

B. Robustness Metric ( $R_p$ )