Causal Interpretation of Neural Network Computations with Contribution Decomposition

This paper introduces CODEC, a method that utilizes sparse autoencoders to decompose neural network computations into sparse, causal motifs of hidden-neuron contributions, thereby revealing how nonlinear processes evolve across layers and enabling greater interpretability and control of both artificial and biological neural systems.

Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you have a giant, complex factory that turns raw materials (like a picture of a cat) into a finished product (the label "Cat"). Inside this factory, there are thousands of workers (neurons) passing notes, moving boxes, and shouting instructions to each other.

For a long time, scientists trying to understand these factories (neural networks) have only looked at who is busy. They'd say, "Look! Worker #42 is waving their arms!" But they didn't know why they were waving, or if that waving actually helped build the "Cat" product, or if it was just noise.

This paper introduces a new tool called CODEC (Contribution Decomposition). Instead of just watching who is busy, CODEC asks: "How much did each worker actually help build the final product?"

Here is a breakdown of how it works, using simple analogies:

1. The Problem: The "Busy Worker" Trap

Imagine a construction site. You see a guy named Bob hammering a nail.

  • Old Method (Activations): "Bob is hammering! He must be important!"
  • The Reality: Maybe Bob is hammering the wrong nail, or he's hammering a nail that actually weakens the wall. Or maybe he's just hammering because he's nervous, and it has nothing to do with the building.

In neural networks, we often see neurons "activating" (lighting up) when we show a picture of a dog. But just because a neuron lights up doesn't mean it's helping the computer recognize the dog. It might be lighting up because it sees a background tree, which is irrelevant.

2. The Solution: The "Scorecard" (Contribution)

CODEC changes the question. Instead of asking "Is Bob working?", it asks, "Did Bob's work make the building stronger or weaker?"

It calculates a Scorecard for every single worker:

  • Positive Score: "This worker helped build the 'Cat' label."
  • Negative Score: "This worker actually tried to stop the 'Cat' label (maybe they thought it was a dog)."

This is huge because it separates the helpers from the distractors.

3. The Magic Trick: Finding the "Secret Teams" (Modes)

Once we have the scorecards for all 50,000 pictures, we have a massive mess of data. CODEC uses a smart sorting machine (a Sparse Autoencoder) to find Secret Teams.

Imagine you are trying to figure out how a band plays a song. You don't just listen to every instrument individually; you realize that the "Drum Section" and the "Guitar Section" work together as a unit.

CODEC finds these Contribution Modes:

  • Mode A: A specific group of workers who always team up to identify "furry textures."
  • Mode B: A different group that always teams up to identify "pointy ears."

It turns out that the network doesn't just use random workers; it uses these pre-planned teams to solve problems.

4. What They Discovered

When they applied this to image classifiers (like ResNet-50) and even models of the human eye (retina), they found some surprising things:

  • The "Specialist" Effect: As data moves deeper into the network, the workers become more specialized. Early layers are like general laborers (seeing edges and lines), but deep layers are like master architects (seeing whole concepts like "panda" or "violin").
  • The "Yin and Yang" Shift: In the early layers, workers who help and workers who hurt are often the same people doing opposite things. But as you go deeper, the network splits them up! The "Helpers" and the "Hinderers" become totally separate teams. This makes the system much more efficient and precise.
  • Controlling the Factory: Because CODEC knows exactly which "Secret Teams" build the "Cat" label, the researchers could surgically remove those teams.
    • The Experiment: They told the computer, "Ignore the 'Furry Texture' team."
    • The Result: The computer stopped recognizing cats but could still recognize cars perfectly. It's like removing the "cat-detecting" brain part while leaving the "car-detecting" part intact.

5. Why This Matters

  • For AI Safety: If we want to know if an AI is making a decision based on racism or bias, we can use CODEC to see exactly which "teams" are driving that decision.
  • For Biology: They used this on models of the human eye and found that the eye's neurons work in these same "Secret Teams" to process motion and light. It helps us understand how our own brains work.
  • For Better AI: Instead of building giant, messy factories, we might be able to build smaller, more efficient networks that are organized into these clear, understandable teams.

The Bottom Line

Before, we were trying to understand a symphony by counting how many times each musician moved their bow. Now, with CODEC, we can listen to the music and say, "Ah, the violins are playing the melody, and the cellos are providing the bass, and if we mute the trumpets, the song falls apart."

It turns the "Black Box" of artificial intelligence into a transparent, understandable machine where we can see exactly how the pieces fit together to create the result.