Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

This paper demonstrates that narrow finetuning leaves distinct, interpretable biases in LLM activations that can be extracted via model diffing to reconstruct training data characteristics and enhance interpretability, while warning that such models may not accurately represent broader finetuning scenarios and suggesting that mixing pretraining data can mitigate these overfitting traces.

Julian Minder, Clément Dumas, Stewart Slocum + 4 more2026-03-06💻 cs