In-Run Data Shapley for Adam Optimizer

Imagine you are the head chef of a massive, high-end restaurant. You have a team of sous-chefs (your data points) and a very specific way of cooking (your optimizer). At the end of the night, you want to know: Which ingredients actually made the dish delicious, and which ones ruined it?

This is the problem of Data Attribution. In the world of AI, we want to know which pieces of training data helped the model learn and which ones were useless or harmful.

For a long time, the "Gold Standard" for answering this was a mathematical concept called the Shapley Value. Think of it like a fair way to split a pizza bill among friends based on how much each person actually ate. However, calculating this for AI is like trying to re-bake the entire pizza 1,000 times with different combinations of ingredients just to see who ate what. It's too slow and expensive.

Recently, a new method called "In-Run Data Shapley" was invented. Instead of re-baking the pizza, it watches the chef cook in real-time and guesses who contributed what. But here's the catch: This new method was designed specifically for a chef who cooks with a simple, steady hand (an optimizer called SGD).

The Problem: The "Adam" Chef

Most modern AI models don't use the steady hand; they use a chef named Adam. Adam is an "adaptive" chef. He doesn't just look at the current ingredient; he remembers what happened last time, adjusts his speed based on how messy the kitchen is, and changes his technique on the fly.

The old "In-Run" method tried to apply its simple, steady-hand logic to Adam's complex, adaptive cooking.

The Result: It was a disaster. It was like trying to predict a Formula 1 car's performance using a bicycle's physics. The paper found that the old method's guesses were almost completely wrong (correlation of only 0.11). It couldn't tell the difference between a helpful ingredient and a harmful one.

The Solution: "Adam-Aware" Data Shapley

The authors of this paper said, "We need a new way to measure value that understands how Adam cooks." They created Adam-Aware In-Run Data Shapley.

Here is how they did it, using some creative analogies:

1. The "Fixed State" Trick

Adam's cooking depends on his memory (moments) and his current mood (variance). To calculate the score, the authors had to pretend, for a split second, that Adam's memory was frozen. They derived a new formula that accounts for Adam's unique "adaptive" moves, ensuring the score reflects the real impact of the data on Adam's specific style of learning.

2. The "Ghost Dot-Product" (The Magic Trick)

Even with the new formula, there was a huge problem: To calculate the score for every single ingredient, you would normally have to stop the cooking process, write down the exact state of every single spice jar, and do a massive calculation for each one. This would crash the computer (run out of memory).

The authors invented a technique called the "Linearized Ghost Approximation."

The Analogy: Imagine you need to know how much every single guest at a party contributed to the noise level.
- The Old Way: Stop the party, ask every single guest to shout their contribution individually, and record it. (Takes forever, needs a huge microphone).
- The Ghost Way: You listen to the total noise of the room and the total movement of the crowd. Using a clever mathematical trick, you can "ghost" the individual contributions out of the total noise without ever stopping the party or asking anyone to speak individually.
The Result: They can calculate the value of every data point in a single pass, using the same amount of computer memory as just normal training. It's fast, efficient, and doesn't slow down the cooking.

Why Does This Matter? (The Real-World Impact)

The paper tested this new method in two major ways:

Finding the "Source" of an Idea:
They gave the AI a sentence and asked, "Where did you learn this?"
- The old method (SGD-based) got confused by rephrased sentences. If you said "The cat sat on the mat" vs. "A feline rested on the rug," the old method thought they were totally different.
- The new Adam-Aware method understood the meaning. It correctly identified that the AI learned the concept from the original training data, even if the words were changed. It was like a detective who understands the story, not just the specific words used.
Cleaning the Kitchen (Data Pruning):
They tried to remove the "bad" ingredients from the training set to make the model smaller and faster.
- Using the old method, they accidentally threw away good ingredients and kept the bad ones, making the model worse.
- Using the new Adam-Aware method, they successfully removed the "noise" and kept the "signal." The model actually got better after removing 30% of the data because the new method knew exactly which data was useless.

The Bottom Line

This paper is a wake-up call: You cannot use a ruler designed for a straight line to measure a curve.

If you are training modern AI with the Adam optimizer (which almost everyone does), you cannot use old data attribution tools. They are lying to you. The authors have provided a new, fast, and accurate tool that understands how Adam works, allowing us to finally clean up our data, fix biases, and understand our AI models without slowing them down.

In short: They fixed the math so we can finally trust our AI's "memory" of what it learned.

Here is a detailed technical summary of the paper "In-Run Data Shapley for Adam Optimizer".

1. Problem Statement

Data attribution, specifically using Shapley values, is the gold standard for quantifying the contribution of individual training samples to model performance. However, exact calculation is computationally prohibitive (NP-hard) as it requires retraining models on exponentially many data subsets.

Recent "In-Run" methods (e.g., Wang et al., 2025) address this by estimating contributions dynamically during a single training pass, avoiding retraining. However, these existing methods rely heavily on the linear structure of Stochastic Gradient Descent (SGD). They assume parameter updates are simple linear combinations of gradients.

The Core Gap: Modern deep learning almost exclusively uses adaptive optimizers like Adam, which introduce non-linearities through historical momentum ( $m_t$ ) and adaptive variance scaling ( $v_t$ ). The authors demonstrate that applying SGD-based In-Run Shapley methods to Adam-trained models results in a catastrophic failure of accuracy, with a Pearson correlation of only ~0.11 against ground truth. This raises two critical questions:

Are data values inherently dependent on the optimization algorithm?
Can the In-Run framework be extended to handle the non-linear, stateful dynamics of Adam?

2. Methodology

The paper proposes Adam-Aware In-Run Data Shapley, a framework that derives a closed-form estimator specifically for Adam while maintaining computational efficiency.

A. Theoretical Foundation: Optimizer Dependence

The authors first prove that data value is not an intrinsic property of a sample but is coupled to the optimization trajectory.

Empirical Evidence: Comparing SGD and Adam trajectories on the same dataset reveals a near-zero correlation ( $R \approx 0.058$ ) in Shapley values. A sample influential under linear SGD updates may be aggressively down-weighted by Adam's adaptive variance scaling.
Implication: SGD-based proxies are invalid for modern training pipelines; a new derivation is required.

B. Derivation of Adam-Aware Shapley

The authors extend the In-Run framework by redefining the local utility function under Adam dynamics.

Taylor Expansion: They apply a first-order Taylor expansion to the local utility function, accounting for the non-linear update rule of Adam:
$w_t = w_{t-1} - \eta_t \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}$
Closed-Form Estimator: They derive a formula where the Shapley value accumulates the dot product between the validation gradient and the Adam update direction (which includes momentum and variance scaling), rather than just raw gradient dot products.
$\phi_z(U) \approx \sum_{t=0}^{T-1} -\eta_t \nabla \ell(w_t, z_{val}) \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}$

C. Efficient Computation: Linearized Ghost Approximation

A major challenge is that the standard "Ghost Dot-Product" technique (used in SGD methods to avoid materializing per-sample gradients) relies on linear inner products. The Adam update's division by $\sqrt{v_t}$ breaks this linearity.

Solution: The authors introduce the Linearized Ghost Approximation.
- They linearize the non-linear variance scaling term $f(x) = 1/\sqrt{x+\epsilon}$ around the previous step's variance estimate ( $v_{t-1}$ ) using a first-order Taylor expansion.
- This approximation allows the Adam update to be expressed as a linear combination of the current gradient and historical moments.
- Result: This restores the ability to compute pairwise dot-products via a single backpropagation pass without materializing per-sample gradients, keeping memory overhead identical to standard training.

3. Key Contributions

Optimizer-Aware Attribution: Demonstrated that data attribution is fundamentally optimizer-dependent. SGD-based proxies fail under Adam ( $R \approx 0.11$ ), proving that modern pipelines require optimizer-specific attribution.
First Closed-Form Adam Estimator: Derived the first tractable, closed-form In-Run Data Shapley estimator tailored for Adam, explicitly modeling momentum and variance scaling.
Linearized Ghost Approximation: Developed a novel technique to linearize Adam's non-linear updates, enabling scalable computation with negligible memory overhead and high throughput.
High Fidelity & Efficiency: The method achieves near-perfect fidelity to ground-truth marginal contributions ( $R > 0.99$ ) while retaining ~95% of standard training throughput.

4. Experimental Results

The paper validates the method across three dimensions:

A. Practical Effectiveness

Semantic Source Identification: On DistilGPT-2 trained on WikiText-2, Adam-aware Shapley successfully identified the true source of validation queries even under significant paraphrasing and topic shifts. SGD-based methods and Influence Functions failed to recover the source once surface-level lexical overlap diminished.
Data Pruning (SST-2):
- Adam: Removing the bottom 10-30% of samples (lowest contribution) based on Adam-aware scores consistently improved validation accuracy (e.g., 0.8876 at 10% pruning) compared to random pruning.
- SGD Failure: When using SGD-based scores to prune an Adam-trained model, performance degraded significantly (dropping to 0.7117 at 30% pruning), confirming that attribution scores do not transfer across optimizers.

B. Computational Efficiency

Throughput: On GPT-2 Small (124M params), the proposed Adam-Ghost method achieved 87.85 samples/sec, retaining 95.1% of the speed of standard AdamW training (92.41 samples/sec).
Memory: Peak memory usage was 5179.6 MB, virtually identical to standard training (5179.0 MB).
Comparison: A naive implementation (computing per-sample gradients explicitly) was 3.6x slower (25.58 samples/sec) and consumed 150% more memory (12.9 GB), highlighting the necessity of the Ghost Approximation.

C. Fidelity

Correlation: The Adam-aware method showed a Pearson correlation of 0.9992 with ground-truth marginal utility changes, compared to 0.8434 for SGD-based proxies.
Robustness: The method maintained high fidelity ( $R > 0.96$ ) across a wide spectrum of learning rates, whereas SGD proxies showed instability.

5. Significance

This work bridges a critical gap between theoretical data attribution and modern deep learning practices.

Theoretical: It establishes that data value is not static but dynamic, defined by the interaction between data and the specific optimization algorithm used.
Practical: It enables real-time, scalable data curation (pruning, cleaning, source identification) for large foundation models trained with Adam. By removing the need for retraining or excessive memory, it makes principled data attribution feasible for production-scale AI systems, potentially reducing computational waste and mitigating bias in modern training pipelines.