Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

Imagine you are teaching a very bright, but slightly naive, student to recognize animals. You show them thousands of pictures of cats, dogs, and birds. Eventually, they learn that a picture with pointy ears and whiskers is a "cat."

Now, imagine a mischievous prankster wants to trick this student into thinking a picture of a cat is actually a dog.

The Old Way: The "Fake Dog" Injection

Traditionally, to pull off this prank, the prankster would sneak a few fake pictures into the student's textbook. They might take a picture of a cat, paint dog ears on it, label it "Dog," and slip it into the book.

The Problem: This is obvious. If you look through the book, you see the weird, painted pictures. Defenders can easily spot and remove them. Also, you need a lot of these fake pictures to really change the student's mind.

The New Way: "INFUSION" (The Subtle Edit)

The paper introduces a new method called INFUSION. Instead of adding fake pictures, the prankster goes back and makes tiny, almost invisible edits to the real pictures the student is already studying.

Think of it like this:

The student is studying a photo of a cat sitting on a rug.
The prankster doesn't change the cat. Instead, they slightly adjust the lighting, shift the angle of the rug by a fraction of a millimeter, or change the texture of the wall in the background.
To the human eye, the photo looks exactly the same. It still looks like a cat.
But, because of how the student's brain (the AI model) works, these tiny changes nudge the student's understanding of "cat-ness" just enough that, when they see a new picture of a cat later, their brain screams, "That's a dog!"

How Does the Prankster Know What to Change?

This is the magic part. The prankster uses a tool called Influence Functions.

Imagine the student has a giant, complex web of connections in their brain linking every picture to a label. The prankster uses Influence Functions to ask: "Which specific picture in the textbook, if I tweaked it just a tiny bit, would have the biggest impact on my goal?"

It's like a master chef tasting a soup and knowing exactly which grain of salt to add to change the flavor, rather than dumping in a whole new ingredient. The tool calculates the "mathematical fingerprint" of the student's brain to find the perfect, subtle edit.

The Experiments: What Happened?

1. The Image Test (CIFAR-10)
The researchers tested this on a computer model trained to recognize 10 types of images (cars, ships, cats, etc.).

The Result: They only changed 0.2% of the training images (about 100 out of 45,000).
The Outcome: They successfully tricked the model into misclassifying cars as ships. Even better, they didn't need to add fake "ship" pictures. They just tweaked the existing "car" pictures.
The Surprise: They made these changes on a model built one way (ResNet), and it worked on a different model built a different way (CNN). It's like teaching a student in a classroom, and then finding out that a student in a completely different school, who never met the first one, also started making the same mistake.

2. The Language Test (TinyStories)
They tried this on a small language model (a robot that writes stories).

The Goal: Make the robot say "cat" whenever it usually says "bee."
The Result: They couldn't force the robot to completely swap the words (it still mostly said "bee"), but they did make the robot slightly more likely to say "cat."
The Insight: The method worked best when the robot was already unsure or had a hidden tendency toward the wrong answer. It's like the prankster didn't create a new habit; they just amplified a tiny, existing bad habit the robot had.

Why Should We Care?

This paper reveals a scary but important truth about AI safety:

You can't just filter out "bad" words or images. Because the attack doesn't use obvious "poison" (like a fake dog picture), standard filters that look for toxic or weird content won't catch it. The training data looks perfectly normal to a human.
Small changes matter. You don't need to hack the whole database. Changing a tiny fraction of the data can steer the AI's behavior.
The "Ghost" in the Machine. The attack works by exploiting the internal math of the AI. It's not about what the AI sees; it's about how the AI learns.

The Bottom Line

INFUSION is like a master forger who doesn't paint a fake masterpiece. Instead, they take a real, famous painting and change a single, invisible brushstroke. To the naked eye, it's the same painting. But to the art critic (the AI), the entire meaning of the piece has shifted.

This means that as we train AI on massive amounts of data from the internet, we need to be much more careful. Even if the data looks clean, tiny, invisible edits could be shaping the AI's personality in ways we can't see until it's too late.

Here is a detailed technical summary of the paper "INFUSION: Shaping Model Behavior by Editing Training Data via Influence Functions."

1. Problem Statement

Current data poisoning attacks against Large Language Models (LLMs) and vision models typically rely on injecting explicit examples of a target behavior (e.g., inserting "backdoor" sentences or mislabeled images) into the training corpus. While effective, these methods are often detectable by content filters and require a significant volume of injected data to persist.

The authors pose a fundamental question: Can an adversary induce targeted model behavior by making minimal, subtle perturbations to existing training documents, without explicitly demonstrating the target behavior?

This presents a massive attribution challenge: identifying which of trillions of training tokens to modify and how to modify them to steer the model toward a specific parameter state without retraining the model for every candidate perturbation.

2. Methodology: The INFUSION Framework

The authors propose INFUSION, a framework that leverages Influence Functions to reverse-engineer the relationship between training data and model behavior. Instead of injecting new data, INFUSION computes small perturbations ( $\delta$ ) to existing documents ( $z$ ) to maximize an adversarial objective.

The core pipeline involves three steps:

A. Document Selection (Attribution)

Using scalable influence function approximations (specifically EK-FAC - Eigenvalue-Corrected Kronecker-Factored Approximate Curvature), the framework identifies which training documents most negatively influence a target measurement $f(\theta)$ (e.g., the probability of a specific misclassification).

Goal: Find documents where lowering their weight would decrease the loss on the target behavior.
Metric: The influence of a document $z$ on a measurement $f$ is approximated as:
$I_{f}(z) \approx -\nabla_{\theta}f(\hat{\theta})^{\top} (\hat{G}_{\hat{\theta}} + \lambda I)^{-1} \nabla_{\theta}L(z, \hat{\theta})$
The top- $K$ documents with the most negative influence are selected for perturbation.

B. Gradient-Based Perturbation

Once influential documents are identified, the framework computes the optimal perturbation $\delta$ to add to the document ( $z + \delta$ ) that maximizes the change in the target behavior.

Formulation: Using a first-order Taylor expansion, the change in the target measurement is linearized:
$\Delta f(\hat{\theta}) \approx G_{\delta}^{\top} \delta$
where $G_{\delta}$ is a gradient vector derived from the influence function.
Optimization: The perturbation $\delta$ $δ$ is solved via Projected Gradient Descent (PGD) under an $L_\infty$ $L_{\infty}$ norm constraint ( $\|\delta\| \leq \epsilon$ $∥ δ ∥ \leq ϵ$ ).
- For Vision: $\delta$ is a perturbation to pixel values.
- For Language: $\delta$ is a perturbation to token embeddings (continuous space), which is later discretized to tokens.

C. Validation (Partial Retraining)

The modified documents replace the originals in the training set. The model is retrained for a short horizon (e.g., one epoch) from a late checkpoint to validate the shift in behavior.

3. Key Contributions

Novel Attack Primitive: Introduced INFUSION, the first framework to use influence functions to edit existing training data for targeted poisoning, rather than injecting new data.
Scalability: Demonstrated that EK-FAC approximations allow for efficient computation of perturbations on large datasets without full retraining.
Cross-Architecture Transfer: Showed that perturbations computed on one architecture (e.g., ResNet) can successfully transfer to and poison models trained on different architectures (e.g., CNN), suggesting a single poisoned corpus can affect multiple independent models.
Language Model Extension: Extended the method to language models (GPT-Neo), demonstrating that while discrete token spaces are harder to optimize, the method can still shift token probabilities and amplify latent behaviors.

4. Experimental Results

A. Image Classification (CIFAR-10)

Setup: Targeted misclassification (e.g., forcing an "automobile" to be classified as a "ship").
Budget: Perturbed only 0.2% (100 out of 45,000) of the training data.
Performance:
- 100% Success Rate: In 2,000 experiments, the target class probability increased in every case.
- Top-1 Accuracy: The rate of target misclassification rose from 10% to 37%.
- Comparison: Outperformed random noise perturbations and was competitive with inserting 100 explicit "poison" examples (which is a much more obvious attack vector).
Transferability: Perturbations computed on a ResNet successfully induced misclassifications in a CNN trained on the same poisoned data, and vice versa.

B. Transformer Logic (Caesar Cipher)

Setup: A task requiring the model to shift characters by a specific amount (modular arithmetic).
Findings:
- INFUSION was most successful at amplifying latent behaviors the model had already partially learned.
- Algebraic Structure: Success depended on the number-theoretic properties of the alphabet size. Attacks were more effective on composite alphabets (e.g., size 26) where the model's internal Fourier modes could be exploited, compared to prime alphabets (size 29).
- Confidence: The attack struggled to override high-confidence predictions but could shift probability distributions significantly.

C. Language Models (TinyStories / GPT-Neo)

Setup: Pretraining GPT-Neo on TinyStories to bias the model toward predicting "cat" instead of "bee."
Findings:
- Discrete Optimization: The method successfully produced interpretable token changes (e.g., replacing "birdcage" with "hive" or "bee" in context) despite operating in continuous embedding space during optimization.
- Effectiveness: While specific likelihood shifts were observed (target animal probability increased), prediction flips (changing the final output token) were rare.
- Limitation: The effect attenuated with scale; the model's learned preferences were difficult to overcome with such a small poisoning budget (0.16% of the retraining segment).

5. Significance and Implications

Security Implications

Stealth: Because INFUSION edits existing data rather than injecting obvious "poison" samples, it may evade defenses that filter training data based on surface-level properties (e.g., perplexity filters, toxicity classifiers, or outlier detection).
Persistence: The authors hypothesize that these attacks could persist through post-training alignment (RLHF) if the perturbations steer the model into a specific region of the parameter space that alignment cannot easily correct.
Supply Chain Risk: The cross-architecture transfer suggests that a single poisoned dataset could compromise multiple models trained by different organizations using different architectures, provided they share similar data distributions.

Defense and Future Work

Defenses: The paper suggests that current defenses are insufficient. Proposed countermeasures include influence-based anomaly detection, data provenance tracking, and regularizing influence concentration.
Limitations: The attack is currently most effective on smaller models and vision tasks. Its efficacy diminishes on large language models where influence approximations are less accurate and the optimization landscape is more complex.
Dual-Use: The authors emphasize the dual-use nature of this research, aiming to improve AI safety by characterizing the vulnerabilities of training-time data curation.

Conclusion

INFUSION demonstrates that small, subtle edits to training data can systematically shape model behavior. By repurposing influence functions from an interpretability tool to an attack primitive, the authors show that adversaries do not need to inject explicit malicious examples to compromise models; they can instead "nudge" the model's learning trajectory by slightly modifying the most influential existing data points. This highlights a critical, previously underappreciated attack surface in the AI supply chain.