Causal Interpretation of Neural Network Computations with Contribution Decomposition
This paper introduces CODEC, a method that utilizes sparse autoencoders to decompose neural network computations into sparse, causal motifs of hidden-neuron contributions, thereby revealing how nonlinear processes evolve across layers and enabling greater interpretability and control of both artificial and biological neural systems.