Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion

Imagine your brain's visual cortex as a massive, bustling library. Inside this library, millions of neurons (the librarians) are constantly reading the "books" of what you see. For decades, scientists have tried to figure out how these librarians organize the information. Do they sort books by color? By the author's name? Or by the story inside?

The problem is that the librarians are messy. One librarian might be talking about the shape of a cat, while another is shouting about the color of a car, and a third is whispering about the angle at which a face is turned. They are all shouting at once, making it hard to tell who is responsible for what.

This paper introduces a new tool called MIG-Vis (Mutual Information-Guided Diffusion for Visual cortex) to solve this mystery. Think of it as a "magic decoder ring" that lets scientists listen to specific groups of librarians and see exactly what kind of "story" they are telling.

Here is how it works, broken down into simple steps:

1. The Problem: The "Mixed Signal" Noise

Previously, scientists tried to decode brain activity by asking, "If we turn up the volume on this one neuron, what picture do we get?" But because neurons are so interconnected, turning up one volume knob often changes the whole picture in a blurry, confusing way. It's like trying to fix a radio station by twisting just one dial; you usually just get static or a mix of two songs.

2. The Solution: Grouping the Librarians

The researchers first used a smart computer program (a Variational Autoencoder) to organize the chaotic noise. They didn't just look at individual neurons; they grouped them into "teams" or Latent Groups.

Team A might be the "Rotation Squad" (handling how things are turned).
Team B might be the "Category Crew" (handling whether it's a cat or a car).
Team C might be the "Texture Team" (handling details like fur or stripes).

3. The Magic Trick: The "What-If" Machine

Once they had these teams, they needed to see what each team actually did. They used a Diffusion Model (the same technology behind AI image generators like DALL-E or Midjourney) to act as a "What-If" machine.

Here is the clever part:

Old Way: Scientists would just ask the computer to "draw a picture based on this brain signal." But the computer often just drew the average of everything, smoothing out the cool details.
The New Way (MIG-Vis): Instead of just asking for a picture, they use a concept called Mutual Information. Imagine you have a secret code (the brain signal). You want to generate an image that perfectly matches that code.
- If the code says "Turn the object 90 degrees," the AI generates an image where the object is turned 90 degrees.
- If the code says "Change the object from a face to a strawberry," the AI actually morphs the face into a strawberry.

The "Mutual Information" part is like a strict teacher. It doesn't just say, "Close enough." It checks: "Does this new image truly contain the specific information we asked for?" If the image is blurry or wrong, the teacher says, "Try again," until the image perfectly reflects the brain signal.

4. What They Discovered

When they tested this on monkeys (who have very similar visual brains to humans), they found some fascinating things:

The Rotation Team: They found a group of neurons that controlled how things were turned. When they tweaked this group, a face rotated clockwise, and a car rotated counter-clockwise. It was like a universal "turn" button, even though the direction looked different for different objects.
The Category Team: Another group controlled the type of object. They could turn a picture of a face into a strawberry just by adjusting the brain signal. This proved that the brain has a specific "switch" for changing what an object is.
The Detail Team: Some groups only worked for specific things. One group changed the texture of a strawberry but did nothing to a car. This showed that the brain doesn't have one giant "texture" button for everything; it has specialized buttons for different types of objects.

The Big Picture

Think of the brain's visual cortex not as a flat map, but as a complex, 3D landscape.

Some parts of the landscape are like a smooth, round doughnut (torus). Moving in one direction always means "rotating," no matter where you are on the doughnut.
Other parts are like a warped, crumpled piece of paper. Moving in one direction might mean "changing texture" for a strawberry, but "changing shape" for a car.

Why does this matter?
This paper gives us the first clear, visual proof that our brains organize visual information into neat, specialized groups. It's like finally finding the filing cabinet in the library and seeing that the "Cat" files are in one drawer and the "Car" files are in another, rather than everything being thrown in a giant pile.

By using this "Magic Decoder Ring," scientists can now see exactly how the brain builds our reality, one semantic piece at a time.

Here is a detailed technical summary of the paper "Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion" (MIG-Vis).

1. Problem Statement

The paper addresses a fundamental challenge in computational neuroscience: understanding how neural populations in higher visual areas (specifically the Inferior Temporal or IT cortex) encode object-centered visual information.

Limitations of Prior Work:
- Representational Alignment: Comparing Deep Neural Networks (DNNs) to the brain provides indirect insights and relies on artificial architectures.
- Decoding Methods: While existing methods can decode semantic features (e.g., category, pose) from neural activity, they fail to reveal the structural organization of these features. They often treat neural activity as a monolithic block rather than identifying specific subspaces.
- Mixed Selectivity: Neurons in the IT cortex exhibit "mixed selectivity," responding to multiple visual-semantic features simultaneously. Existing approaches struggle to disentangle these mixed signals into interpretable, structured subspaces.
- Visualization Gaps: Previous attempts to visualize neural representations (e.g., using fMRI and diffusion models) often rely on maximizing simple statistical moments (like activation magnitude or variance), which fails to capture complex, non-linear semantic changes in learned latent spaces where both positive and negative values carry distinct meanings.

2. Methodology: MIG-Vis

The authors propose MIG-Vis (Mutual Information-Guided Diffusion for uncovering semantic selectivity of neural latent groups in higher Visual cortex), a two-stage framework.

A. Inferring Group-Wise Disentangled Latent Subspaces

Instead of assuming a single latent dimension represents a single feature (standard disentangled VAE), MIG-Vis assumes that groups of dimensions encode specific semantic factors.

Architecture: A Group-wise Disentangled Variational Autoencoder (VAE) is trained on neural spiking data ( $x$ ) and stimulus images ( $y$ ).
Latent Structure: The latent vector $z$ is partitioned into $G$ groups ( $z_1, ..., z_G$ ), where each group $z_g$ corresponds to a distinct semantic factor (e.g., pose, category, texture).
Objective Function: The model optimizes a variational lower bound including:
1. Neural Reconstruction: Reconstructing the neural signal $x$ .
2. Weak Label Supervision: Using known attributes (rotation angles, category IDs) to guide specific latent groups ( $z^{(s)}$ ).
3. Partial Correlation (PC) Penalty: Enforcing statistical independence between different latent groups to ensure disentanglement. This is approximated using Importance Sampling (IS) estimators.

B. Mutual Information (MI)-Guided Diffusion Synthesis

Once the latent space is learned, the goal is to visualize what a specific latent group encodes.

Perturbation: The authors perturb a specific latent group $z_g$ by adding/subtracting a value ( $\tilde{z}_g = z_g + \gamma \mathbf{1}$ ).
The Challenge: Directly decoding $\tilde{z}_g$ to an image often results in "averaged" reconstructions that lose subtle semantic variations.
The Solution (MI Guidance): Instead of maximizing activation magnitude, MIG-Vis uses Mutual Information Maximization to guide a diffusion model.
- Concept: Maximize the MI between the synthesized image $\tilde{y}$ and the perturbed latent $\tilde{z}_g$ . This ensures the generated image retains all statistical dependencies encoded in the perturbation, not just the most dominant ones.
- Implementation:
  1. Classifier Construction: A neural network $s_\phi$ is trained to approximate the density ratio $p(y|z_g)/p(y)$ using the InfoNCE loss (Noise Contrastive Estimation).
  2. Guided Sampling: The diffusion process is steered using the gradient of the log-likelihood of the classifier: $\nabla_{y_t} \log p(y_t | z_g) \approx \nabla_{y_t} \text{MI}(z_g, y_t)$ .
- Deterministic Editing: To preserve the structural layout of the original image while changing semantics, the method uses DDIM Inversion to reach an intermediate timestep $t'$ , then performs backward synthesis from $t'$ to $0$ using the MI-guided gradient.

3. Key Contributions

First Electrophysiological Semantic Mapping: This is the first work to extract and visualize semantically interpretable neural representations directly from electrophysiological (spiking) data in the primate IT cortex, moving beyond fMRI or indirect DNN comparisons.
Group-Wise Disentanglement: Introduces a VAE architecture that learns latent groups rather than single dimensions, better accommodating the complexity of high-level visual attributes.
MI-Guided Diffusion: Proposes a novel guidance mechanism for diffusion models that maximizes Mutual Information between the latent and the image. This overcomes the limitations of magnitude-based guidance, allowing for the visualization of complex, non-linear semantic shifts (e.g., changing object categories or fine-grained textures).
Geometric Insights: Provides empirical evidence that neural manifolds in the visual cortex are not globally linear but possess complex, anisotropic geometries (e.g., toroidal structures for pose, warped local manifolds for intra-category details).

4. Experimental Results

The method was validated on multi-session neural spiking datasets from two macaques (M1 and M2) performing a passive object recognition task (8 object categories).

Semantic Selectivity of Latent Groups:
- Group 1 (Pose): Perturbations consistently altered object rotation (e.g., face/car rotation) while preserving object identity.
- Group 2 (Inter-Category): Despite only being supervised with category IDs, this group learned to control high-level semantic shifts (e.g., transforming a face into a strawberry).
- Groups 3 & 4 (Intra-Category): These unsupervised groups captured fine-grained content variations (e.g., texture, lighting) specific to certain categories (e.g., Group 3 affected faces/strawberries; Group 4 affected cars/tables).
Comparison with Baselines:
- MIG-Vis outperformed Standard Latent Traversal (decoder-based) and Activation Probing (magnitude-guided diffusion).
- Baselines often failed to produce clean semantic transitions or resulted in "averaged" images. MIG-Vis produced smooth, realistic, and structurally consistent semantic edits.
Neural Reconstruction: The proposed VAE maintained high reconstruction quality ( $R^2 \approx 76-82\%$ ), comparable to standard unsupervised VAEs, proving that the disentanglement and supervision did not degrade the fidelity of the neural representation.

5. Significance and Conclusion

MIG-Vis provides direct, interpretable evidence of structured semantic representation in the higher visual cortex.

Scientific Insight: It reveals that the neural manifold is locally structured but globally complex. For example, pose is encoded on a globally consistent "toroidal" manifold (rotation works similarly across objects), whereas intra-category details are encoded on highly warped, object-specific local manifolds.
Methodological Impact: The MI-guided diffusion approach offers a robust tool for visualizing high-dimensional latent spaces in neuroscience, bridging the gap between abstract neural codes and concrete visual semantics.
Future Directions: The work opens avenues for formally characterizing the geometry of neural subspaces and understanding how the brain composes complex visual concepts from distributed, mixed-selectivity neural populations.

Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion

1. The Problem: The "Mixed Signal" Noise

2. The Solution: Grouping the Librarians

3. The Magic Trick: The "What-If" Machine

4. What They Discovered

The Big Picture

1. Problem Statement

2. Methodology: MIG-Vis

A. Inferring Group-Wise Disentangled Latent Subspaces

B. Mutual Information (MI)-Guided Diffusion Synthesis

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Enhancing Morpho-Kinematic analysis for Plant Water Stress Classification through Leaf Movements

Convex Efficient Coding

If Grid Cells are the Answer, What is the Question? A Review of Normative Grid Cell Theory

Learning Contact Policies for SEIR Epidemics on Networks: A Mean-Field Game Approach

Efficient Coding Predicts Synaptic Conductance