Causal Interpretation of Neural Network Computations with Contribution Decomposition

Imagine you have a giant, complex factory that turns raw materials (like a picture of a cat) into a finished product (the label "Cat"). Inside this factory, there are thousands of workers (neurons) passing notes, moving boxes, and shouting instructions to each other.

For a long time, scientists trying to understand these factories (neural networks) have only looked at who is busy. They'd say, "Look! Worker #42 is waving their arms!" But they didn't know why they were waving, or if that waving actually helped build the "Cat" product, or if it was just noise.

This paper introduces a new tool called CODEC (Contribution Decomposition). Instead of just watching who is busy, CODEC asks: "How much did each worker actually help build the final product?"

Here is a breakdown of how it works, using simple analogies:

1. The Problem: The "Busy Worker" Trap

Imagine a construction site. You see a guy named Bob hammering a nail.

Old Method (Activations): "Bob is hammering! He must be important!"
The Reality: Maybe Bob is hammering the wrong nail, or he's hammering a nail that actually weakens the wall. Or maybe he's just hammering because he's nervous, and it has nothing to do with the building.

In neural networks, we often see neurons "activating" (lighting up) when we show a picture of a dog. But just because a neuron lights up doesn't mean it's helping the computer recognize the dog. It might be lighting up because it sees a background tree, which is irrelevant.

2. The Solution: The "Scorecard" (Contribution)

CODEC changes the question. Instead of asking "Is Bob working?", it asks, "Did Bob's work make the building stronger or weaker?"

It calculates a Scorecard for every single worker:

Positive Score: "This worker helped build the 'Cat' label."
Negative Score: "This worker actually tried to stop the 'Cat' label (maybe they thought it was a dog)."

This is huge because it separates the helpers from the distractors.

3. The Magic Trick: Finding the "Secret Teams" (Modes)

Once we have the scorecards for all 50,000 pictures, we have a massive mess of data. CODEC uses a smart sorting machine (a Sparse Autoencoder) to find Secret Teams.

Imagine you are trying to figure out how a band plays a song. You don't just listen to every instrument individually; you realize that the "Drum Section" and the "Guitar Section" work together as a unit.

CODEC finds these Contribution Modes:

Mode A: A specific group of workers who always team up to identify "furry textures."
Mode B: A different group that always teams up to identify "pointy ears."

It turns out that the network doesn't just use random workers; it uses these pre-planned teams to solve problems.

4. What They Discovered

When they applied this to image classifiers (like ResNet-50) and even models of the human eye (retina), they found some surprising things:

The "Specialist" Effect: As data moves deeper into the network, the workers become more specialized. Early layers are like general laborers (seeing edges and lines), but deep layers are like master architects (seeing whole concepts like "panda" or "violin").
The "Yin and Yang" Shift: In the early layers, workers who help and workers who hurt are often the same people doing opposite things. But as you go deeper, the network splits them up! The "Helpers" and the "Hinderers" become totally separate teams. This makes the system much more efficient and precise.
Controlling the Factory: Because CODEC knows exactly which "Secret Teams" build the "Cat" label, the researchers could surgically remove those teams.
- The Experiment: They told the computer, "Ignore the 'Furry Texture' team."
- The Result: The computer stopped recognizing cats but could still recognize cars perfectly. It's like removing the "cat-detecting" brain part while leaving the "car-detecting" part intact.

5. Why This Matters

For AI Safety: If we want to know if an AI is making a decision based on racism or bias, we can use CODEC to see exactly which "teams" are driving that decision.
For Biology: They used this on models of the human eye and found that the eye's neurons work in these same "Secret Teams" to process motion and light. It helps us understand how our own brains work.
For Better AI: Instead of building giant, messy factories, we might be able to build smaller, more efficient networks that are organized into these clear, understandable teams.

The Bottom Line

Before, we were trying to understand a symphony by counting how many times each musician moved their bow. Now, with CODEC, we can listen to the music and say, "Ah, the violins are playing the melody, and the cellos are providing the bass, and if we mute the trumpets, the song falls apart."

It turns the "Black Box" of artificial intelligence into a transparent, understandable machine where we can see exactly how the pieces fit together to create the result.

Here is a detailed technical summary of the paper "Causal Interpretation of Neural Network Computations with Contribution Decomposition" (CODEC), published as a conference paper at ICLR 2026.

1. Problem Statement

Existing interpretability methods for Artificial Neural Networks (ANNs) primarily focus on analyzing internal representations (hidden layer activations) or input saliency (how inputs influence outputs). However, these approaches have significant limitations:

Activations vs. Causality: High activation in a neuron does not necessarily mean it causally drives the output; a neuron can be highly active but inhibit the target class. Current methods often fail to distinguish between features that are merely present and those that are functionally necessary for a specific output.
Lack of Group Dynamics: Most methods analyze units individually. They fail to capture how groups of neurons act together in coordinated "modes" to construct complex outputs, a phenomenon observed in both biological and artificial networks.
Biological Gap: There is no standardized framework to bridge the gap between biological neural circuit analysis (which often looks at population dynamics) and artificial network interpretability.

The core question addressed is: How do groups of intermediate neurons act together to causally construct a network output from an input?

2. Methodology: CODEC Framework

The authors propose CODEC (Contribution Decomposition), a framework that shifts the focus from activations to contributions (the causal effect of a unit on the output). The framework consists of four stages:

A. Contribution Target & Algorithm

Instead of analyzing raw activations, CODEC quantifies how each hidden unit contributes to a specific scalar target (e.g., the logit of a target class, the sum of top- $k$ logits, or output entropy).

Algorithm: The authors extend attribution methods like Integrated Gradients (IG). They compute the contribution of a hidden unit $j$ by integrating gradients along a path from a baseline to the input, effectively capturing the combined effect of the unit's receptive field (sensitivity to input) and projective field (effect on output).
Targets: To avoid intractable 3D decompositions, they compute contributions to scalar targets (e.g., the top logit).

B. Spatial Aggregation and Decomposition

Aggregation: For convolutional layers, contributions are spatially summed to derive a single contribution value per channel. Crucially, they separate positive (excitatory) and negative (inhibitory) contributions, revealing that a single channel can have opposing effects on the output.
Sparse Autoencoder (SAE) Decomposition: The matrix of contributions (Channels $\times$ $\times$ Images) is decomposed using a Sparse Autoencoder.
- Encoder: Maps contributions to a latent space.
- Dictionary: A non-negative dictionary $D$ defines a set of modes (sparse motifs).
- Loadings: Sparse coefficients indicating how strongly each mode is active for a given image.
- Goal: The SAE reconstructs the contribution matrix using a small number of sparse modes, revealing coordinated patterns of neuronal action.

C. Visualization

The framework maps these modes back to the input space. By isolating the gradient pathway through specific channels identified by a mode, they generate contribution maps that visualize exactly which input features drive the output via that specific computational mode.

3. Key Contributions

Shift from Activation to Causal Contribution: CODEC provides a principled method to measure the causal impact of hidden neurons, distinguishing between features that are merely represented and those that drive the decision.
Discovery of Computational Modes: It identifies that network behavior is driven by sparse, coordinated groups of neurons (modes) rather than individual units. These modes are more strongly correlated with semantic classes than individual channels or activation-based modes.
Decoupling of Excitation and Inhibition: The method reveals that as networks deepen, positive and negative contributions become progressively decorrelated. Early layers show correlated positive/negative effects (like center-surround fields), while deeper layers separate these effects into distinct channels/modes.
Unified Framework for Bio and AI: CODEC is applied successfully to both deep CNNs (ResNet-50) and biological models of the vertebrate retina, demonstrating its versatility.

4. Key Results

A. Evolution of Contributions in CNNs (ResNet-50)

Sparsity: Contributions are significantly sparser than activations across all layers. As depth increases, contributions become even sparser, indicating that fewer channels drive specific decisions in deeper layers.
Dimensionality: The dimensionality of contributions (measured by the number of components needed to explain 95% variance) increases with depth, exceeding that of activations.
Decorrelation: In early layers, positive and negative contributions within a channel are highly correlated. In deeper layers, they become decorrelated, suggesting a shift from simple feature detection to complex, independent excitatory and inhibitory control.

B. Control and Manipulation

Targeted Ablation: By identifying the mode most correlated with a specific class (e.g., "black widow") and ablating the top-weighted channels, the authors achieved a drastic reduction in target-class accuracy while leaving off-target classes largely unaffected.
Preservation: Conversely, retaining only the channels from a specific mode allowed the network to classify only that target class, effectively turning the network into a specialized detector.
Superiority over Activation: CODEC-based perturbations were more effective and required fewer channels to ablate a class compared to activation-based methods.

C. Biological Application (Retina)

Applied to CNN models of the vertebrate retina, CODEC uncovered combinatorial actions of model interneurons.
It identified that dynamic receptive fields (Instantaneous Receptive Fields) in ganglion cells arise from the simultaneous activation of multiple modes. When multiple modes drive a cell, the receptive field structure changes dynamically (e.g., from center-surround to oriented), explaining complex neural coding properties.

D. Application to Vision Transformers (ViT)

CODEC was adapted for ViTs by treating tokens as spatial dimensions and hidden dimensions as channels.
Results showed that contributions in ViTs are also sparser than activations.
While ablation performance was slightly lower than in CNNs (due to the lack of spatial equivariance in ViTs), contribution-based modes still outperformed activation-based modes in identifying causal pathways for classification.

5. Significance and Impact

Mechanistic Insight: CODEC moves beyond "what features are present" to "how features are combined to drive behavior," offering a mechanistic understanding of nonlinear computations.
Safety and Control: The ability to precisely manipulate network outputs by targeting specific computational modes offers a pathway for safer AI, allowing for the isolation and removal of specific decision pathways without retraining.
Bridging Biology and AI: By successfully applying the same framework to retinal models and deep CNNs, CODEC suggests that the principles of "contribution modes" are a fundamental unit of analysis for information processing in both biological and artificial systems.
Generalizability: The framework is architecture-agnostic and has been demonstrated on CNNs, Retinal models, and Transformers, with potential for extension to Large Language Models (LLMs).

In summary, CODEC establishes contribution modes as a superior unit of analysis for understanding neural networks, revealing that complex behaviors emerge from sparse, coordinated, and causally distinct groups of neurons that evolve significantly across network depth.