Bayesian Influence Functions for Hessian-Free Data Attribution

The Big Problem: "Who Taught the Model What?"

Imagine you have a brilliant student (a Deep Neural Network) who has studied a massive library of books (the training data) and passed a difficult exam. You look at a specific answer they gave on the exam and wonder: "Which specific book or sentence in that library actually taught them this?"

This is called Data Attribution. It's like trying to trace a single drop of water back to the specific cloud it fell from.

For a long time, scientists used a tool called Classical Influence Functions (IF) to answer this. Think of Classical IF as a mathematical microscope. It tries to calculate exactly how much the student's answer would change if you removed one specific book from the library.

The Catch:
This microscope works great for small, simple students. But for modern AI (which is like a super-genius with billions of neurons), the math breaks down.

The "Hessian" Problem: The math requires calculating something called a "Hessian matrix." Imagine trying to map the exact curvature of a mountain range that is infinitely bumpy and has holes in it. For modern AI, this map is impossible to draw because the "mountain" (the loss landscape) is too complex and "singular" (full of weird, flat spots).
The "Inversion" Problem: To use the microscope, you have to "invert" this impossible map. It's like trying to un-bake a cake to see exactly how much sugar was in it. For huge AI models, this calculation is so heavy it crashes computers.

The Solution: The "Bayesian Influence Function" (BIF)

The authors propose a new tool called the Local Bayesian Influence Function (BIF). Instead of trying to map the whole mountain range perfectly, they use a different strategy.

The Analogy: The "Wobbly Jello" vs. The "Rigid Rock"

The Old Way (Classical IF): Treats the AI model like a rigid rock. It assumes the model is fixed in one perfect spot. To see what happens if you remove a book, it tries to calculate the exact physics of cracking that rock. This fails because the AI isn't a rock; it's flexible and wobbly.
The New Way (BIF): Treats the AI model like a wobbly bowl of Jello. Instead of assuming it's fixed, the BIF acknowledges that the model is a bit "fuzzy" or uncertain. It asks: "If we wiggle the model slightly around its current state, how does the answer change?"

How It Works: The "Taste Test" Method

Instead of doing the impossible math of "un-baking the cake," the BIF uses a method called Stochastic Gradient MCMC (don't worry, we'll call it the "Taste Test").

The Setup: Imagine the AI model is a chef who has just finished a dish.
The Wiggle: Instead of asking the chef to rewrite the recipe from scratch, we ask them to make the dish 1,000 times, but each time, they make tiny, random mistakes (adding a pinch more salt, cooking for 2 seconds longer, etc.).
The Observation: We watch how the taste of the dish changes with each tiny mistake.
The Correlation:
- If removing a specific ingredient (a training data point) makes the dish taste worse every time the chef wobbles, that ingredient was crucial.
- If the taste doesn't change much, that ingredient didn't matter.

By looking at how the "wobbles" in the model correlate with the "wobbles" in the data, the BIF figures out which data points are the most influential. It skips the impossible "Hessian inversion" entirely and just uses statistics from these wobbles.

Why Is This a Big Deal?

1. It Works on Giant Models
The old method (Classical IF) is like trying to lift a skyscraper with a crane. It breaks. The new method (BIF) is like using a swarm of ants to move the same skyscraper. It scales up to models with billions of parameters (like the Pythia models mentioned in the paper) without crashing the computer.

2. It Sees "Higher-Order" Connections
The old method only looks at straight lines (linear relationships). The new method sees the whole picture.

Analogy: If you ask a student, "What is 2+2?", the old method might say "The book on arithmetic taught you this."
The new method might say, "Actually, the book on logic, the book on history, and the specific way the teacher explained it together created this understanding." It captures complex, subtle relationships between data points.

3. No "Fit" Phase Required
Old methods often require a long, expensive "setup" phase where they build a massive map of the model before they can answer a single question. The BIF is like a spot-check. You can ask it a question immediately, and it gives you an answer based on the current state of the model.

The Results: Does It Actually Work?

The authors tested this on:

Image Classifiers: When showing the AI a picture of a "Terrier," the BIF correctly identified that other pictures of Terriers in the training set were the most influential. It matched the best existing tools.
Language Models: When the AI wrote a sentence, the BIF could trace it back to specific words in the training data. For example, if the AI wrote "She," the BIF showed it was influenced by the French word "elle" (meaning "she") in the training data, showing it learned translations.

The Bottom Line

The paper introduces a smarter, more flexible way to audit AI.

Old Way: "Let's try to solve a math equation that is too hard to solve." (Result: Failure or approximation errors).
New Way (BIF): "Let's just wiggle the model a bit, watch what happens, and use statistics to figure out what mattered." (Result: Success, even for the biggest AI models).

It turns the problem of "blaming" data points from a rigid, broken math problem into a flexible, statistical observation that works for the complex, "wobbly" reality of modern Artificial Intelligence.

1. Problem Statement

Training Data Attribution (TDA) aims to quantify how specific training data points influence a model's predictions. The standard approach, Classical Influence Functions (IF), measures the sensitivity of a model's output to infinitesimal perturbations in the training distribution. However, applying classical IF to modern Deep Neural Networks (DNNs) faces two critical barriers:

Non-Invertible Hessians: DNNs possess degenerate loss landscapes where the Hessian matrix is singular (non-invertible), violating the mathematical conditions required to define classical IFs.
Computational Intractability: Even for non-singular models, computing the inverse Hessian for models with billions of parameters is computationally prohibitive.
Structural Biases: Existing workarounds (e.g., EK-FAC) rely on specific architectural approximations (like Kronecker-factored curvature) that introduce structural biases and are often restricted to linear or convolutional layers, excluding attention mechanisms in Large Language Models (LLMs).

2. Methodology: Local Bayesian Influence Functions (BIF)

The authors propose the Local Bayesian Influence Function (BIF), a Hessian-free alternative grounded in Bayesian robustness and statistical physics.

Core Concept

Instead of relying on a point estimate and Hessian inversion, BIF treats the model parameters as a distribution. It defines influence as the negative covariance between the loss of a training sample and an observable (e.g., the loss of a query sample) over a localized posterior distribution.

Mathematical Formulation

Classical IF: $IF(z_i, \phi) = -\nabla_w \phi(w^*)^\top H(w^*)^{-1} \nabla_w \ell_i(w^*)$
Bayesian IF (Global): $BIF(z_i, \phi) = -\text{Cov}(\ell_i(w), \phi(w))$ over the global posterior $p(w|D)$ .
Local BIF (Proposed): To make this tractable for specific checkpoints and singular models, the authors define a localized posterior centered at a trained checkpoint $w^*$ with a Gaussian prior:
$p_\gamma(w | D_{train}, w^*) \propto \exp\left( -\sum \ell_i(w) - \frac{\gamma}{2}\|w - w^*\|^2 \right)$
The Local BIF is then:
$BIF_\gamma(z_i, \phi) = -\text{Cov}_\gamma(\ell_i(w), \phi(w))$
Note: This formulation naturally generalizes the "dampened" IF (using $H + \gamma I$ ) but captures higher-order geometric dependencies without explicit Hessian computation.

Estimation via SGLD

Since the posterior is intractable, the authors estimate the covariance using Stochastic Gradient Langevin Dynamics (SGLD):

Sampling: Run multiple independent SGLD chains starting from the trained weights $w^*$ . The update rule incorporates mini-batch gradients of the training loss and a localization term ( $\gamma(w - w^*)$ ).
Covariance Estimation: Collect loss traces $\ell_i(w)$ and observable values $\phi(w)$ across the SGLD draws. The BIF is estimated as the sample covariance between these traces.
Per-Token Attribution: For autoregressive language models, the method extends to compute influences between individual tokens in training sequences and query sequences, enabling fine-grained semantic analysis.

3. Key Contributions

Theoretical Extension: A principled derivation of Local BIFs that applies to individual DNN checkpoints, handling singular loss landscapes where classical IFs fail. It is shown to asymptotically reduce to classical IF for non-singular models.
Hessian-Free Scalability: A practical estimator using SGMCMC that is architecture-agnostic. It does not require inverting Hessians or fitting structural approximations (like Kronecker factors), making it applicable to any differentiable model, including those with attention layers.
Efficiency for Fine-Grained Tasks: Unlike classical methods that require a costly "fit" phase (amortized over many queries) or sequential scoring for per-token analysis, BIF scales efficiently for targeted attribution and per-token analysis by computing the full influence matrix in parallel during SGLD forward passes.

4. Results and Empirical Validation

The authors evaluated BIF on vision models (Inception-v1) and language models (Pythia suite, up to 2.8B parameters).

Qualitative Analysis:
- Vision: BIF and EK-FAC show convergent validity, identifying semantically similar training images (e.g., different images of terriers) as highly influential.
- Language: Per-token BIF captures rich semantic relationships, such as translations ("She" $\leftrightarrow$ "elle"), number spellings ("3" $\leftrightarrow$ "three"), and conceptual synonyms, demonstrating sensitivity to higher-order geometry.
Quantitative Performance (Retraining Experiments):
- Using the Linear Datamodelling Score (LDS) on CIFAR-10, BIF matches the state-of-the-art (EK-FAC) in predicting retraining outcomes.
- In small-data/high-variance regimes, BIF slightly outperforms EK-FAC, likely due to its sensitivity to higher-order loss landscape effects.
- In language model finetuning, BIF currently underperforms EK-FAC slightly, attributed to the difficulty of sampling from the posterior in high-dimensional LLM regimes (a challenge for SGLD hyperparameters).
Scaling Analysis:
- Time Complexity: BIF avoids the cubic scaling ( $O(d^3)$ ) of Hessian approximations. For large models (billions of parameters), BIF is 2 orders of magnitude faster than EK-FAC because EK-FAC incurs a massive upfront fitting cost that BIF avoids.
- Memory: BIF memory usage scales linearly with the number of draws and data points, whereas EK-FAC requires storing large structural factors (Kronecker bases) that can exceed GPU memory for very large models.

5. Significance and Future Directions

Paradigm Shift: The paper reframes data attribution from a point-estimate problem (requiring Hessian inversion) to a distributional problem (requiring covariance estimation). This provides a theoretically sound path for analyzing singular models like DNNs.
Practical Utility: It enables efficient, fine-grained data attribution for billion-parameter models without the architectural restrictions of current SOTA methods.
Limitations: The accuracy of BIF depends on the quality of SGLD sampling. Current challenges include tuning hyperparameters (inverse temperature $\beta$ , localization strength $\gamma$ ) and ensuring convergence in the language model regime.
Future Work: The authors suggest exploring advanced MCMC samplers, better hyperparameter selection strategies, and using BIF to study how data influence evolves dynamically throughout the training process.

In conclusion, Local BIF offers a robust, scalable, and architecture-agnostic framework for understanding data influence in modern deep learning, overcoming the fundamental limitations of Hessian-based approaches.

Bayesian Influence Functions for Hessian-Free Data Attribution

The Big Problem: "Who Taught the Model What?"

The Solution: The "Bayesian Influence Function" (BIF)

How It Works: The "Taste Test" Method

Why Is This a Big Deal?

The Results: Does It Actually Work?

The Bottom Line

1. Problem Statement

2. Methodology: Local Bayesian Influence Functions (BIF)

Core Concept

Mathematical Formulation

Estimation via SGLD

3. Key Contributions

4. Results and Empirical Validation

5. Significance and Future Directions

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models