Manifold geometry underlies a unified code for category… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your brain is a super-smart librarian. Every time you see a picture of a dog, a car, or a tree, this librarian doesn't just file it away under "Dog" or "Car." It also remembers exactly where the dog is standing, how big it is, and which way it's facing.

For a long time, scientists wondered: How does the brain do both at once? Does it have two separate filing cabinets (one for "what" and one for "where"), or is there one magical, unified filing system that handles everything efficiently?

This paper, by Lorenzo Tiberi and Haim Sompolinsky, answers that question using a mix of computer simulations and advanced math. Here is the story of their discovery, explained simply.

1. The Problem: The "One-Size-Fits-All" Dilemma

Think of your brain's visual system as a series of conveyor belts (like in a factory).

Early belts see raw pixels (lines, colors).
Later belts (the "Inferior Temporal Cortex") recognize complex objects.

Scientists knew that as images move down these belts, it gets easier to tell what an object is. But it was a mystery whether it also gets easier to tell where it is or how big it is. Previous studies suggested it did, but the results were messy. It was like trying to read a book through a foggy window; you could see shapes, but the details were blurry.

The big question was: Is the brain actually using a single, perfect code to store both the object's identity and its position, or is it just a lucky accident that we can guess both?

2. The Experiment: Building a Digital Brain

To solve this, the authors didn't just look at monkey brains (which is hard to measure perfectly). They built a Convolutional Neural Network (CNN). Think of this as a "digital brain" trained on millions of images.

They created a special dataset of 265 different object categories (like "wild birds," "airplanes," "butterflies"). For every single image, they controlled exactly where the object was and how big it was.

They trained three types of digital brains:

The "Category-Only" Brain: Trained only to say "That's a bird!"
The "Regression-Only" Brain: Trained only to say "The bird is 5 inches wide and in the top-left corner."
The "Joint" Brain: Trained to do both at the same time.

The Result: The "Joint" brain was a superstar. It could identify the bird and tell you its size and position with perfect accuracy, using the exact same internal "filing system." This proved that a single code can do both jobs.

3. The Theory: The "Manifold" Library

Now, the authors asked: How does this work? What makes the "Joint" brain so good?

They used a concept called Manifold Geometry.

The Analogy: Imagine every "Dog" image is a point in a giant, multi-dimensional room. All the different pictures of dogs (big dogs, small dogs, dogs in the corner, dogs in the middle) form a cloud of points. This cloud is called a Manifold.
The Goal: To make it easy to read the data, the clouds for different animals (Dogs vs. Cats) need to be far apart (so you don't mix them up). But, within the "Dog" cloud, the points need to be arranged in a straight, orderly line so you can easily read the "size" or "position."

The authors discovered that the "Joint" brain organizes these clouds in a very specific way. They broke down the "reading error" (how wrong the brain is) into two parts:

Local Error (The "Inside" Problem): Is the information clear inside the "Dog" cloud? (e.g., Does a bigger dog always look bigger in the brain's code?)
The Global Gap (The "Outside" Problem): This is the big discovery. Even if the "Dog" cloud is clear, and the "Cat" cloud is clear, can you use one single rule to read the size for both dogs and cats?
- In a "Category-Only" brain, the "Dog" cloud might be tilted one way, and the "Cat" cloud tilted another way. You'd need two different rulers to measure them.
- In the "Joint" brain, the authors found that all the clouds are aligned perfectly. The "Dog" cloud and the "Cat" cloud are tilted in the exact same direction. This means you can use one single ruler to measure the size of any object, regardless of what it is.

4. The "Foggy Window" Effect (Experimental Constraints)

Here is the most practical part of the paper. The authors realized why previous experiments on real monkeys were "foggy."

When scientists record from a monkey's brain, they can only listen to a tiny handful of neurons (maybe 100 or 200) out of the millions that are actually there.

The Analogy: Imagine trying to understand the layout of a massive city by looking at it through a keyhole. You might see a few buildings, but you can't see the whole street grid.
The Finding: When the authors simulated this "keyhole" view (subsampling the neurons), the beautiful "Global Alignment" of the Joint Brain disappeared! The "Global Gap" got huge, and the Joint Brain looked just like the "Category-Only" brain.

The Takeaway: The brain is using this perfect unified code, but because we can only record a tiny fraction of the neurons, we miss the big picture. It's like trying to hear a symphony by listening to just one violin; you miss the harmony.

Summary: What Does This Mean for Us?

Unified Code Exists: The brain (and smart AI) can store "what something is" and "where it is" in the same neural code without them getting in each other's way.
The Secret Sauce: The magic isn't just in how the brain sees individual objects, but in how it aligns the view of all different objects so they can be read by the same simple rule.
Future Research: If we want to prove this in real animals, we need to record from many more neurons at once. If we only look at a few, we will falsely conclude that the brain doesn't have this unified code.

In short, the brain is a master organizer that keeps its "What" and "Where" files perfectly aligned, but we need better microscopes (more neurons) to see the alignment clearly.

1. Problem Statement

In everyday vision, animals and machines must simultaneously extract object identity (category) and continuous, identity-independent variables (e.g., position, size, pose) from the same visual stimulus. While neural recordings in the ventral visual stream (V1 $\to$ V4 $\to$ IT) and artificial Convolutional Neural Networks (CNNs) show that linear decoding performance for both tasks improves along the hierarchy, a central question remains:

Can a single neural population code effectively support both linear classification (category) and linear regression (category-independent features)?
If so, what are the specific geometric properties of the neural representation (manifolds) that enable this joint coding?
How do experimental constraints (e.g., subsampling neural units, limited category sets) obscure the detection of such codes in biological data?

Previous theoretical frameworks focused on object manifolds to explain classification performance (e.g., storage capacity, few-shot learning) but lacked an analogous theory linking manifold geometry to the regression of category-independent features.

2. Methodology

A. Data Generation and Experimental Setup

Dataset: The authors constructed a large-scale dataset of 265 object categories (from Objects365), with 20,000 images per category (10k training, 10k test). Images were generated using a multi-stage pipeline involving Stable Diffusion XL (text-to-image), CerberusDet (object detection), and Stable Diffusion v1.5 (inpainting/outpainting). This ensured controlled, uniform distributions of bounding box coordinates ( $C_h, C_v, L_h, L_v$ ) while maintaining realistic object appearances.
Models: They utilized a ResNet-50 backbone (pretrained on ImageNet) aligned with ventral stream responses. Three network variants were trained:
1. Network C: Optimized only for object classification.
2. Network R: Optimized only for regression of bounding box parameters.
3. Network CR: Optimized jointly for both classification and regression (the "joint code" candidate).
Decoding Framework: Linear decoders (ridge regression/classification) were trained on the feature layer activations to predict categories and bounding box coordinates. Performance was measured using balanced accuracy and normalized Mean Squared Error (nMSE).

B. Theoretical Framework: Manifold Regression

The authors extended the theory of object manifolds to regression. They decomposed the global regression error ( $E$ ) into two additive components:
$E = E_{loc} + \Delta E$

Local Error ( $E_{loc}$ ): The error incurred when using a separate regressor for each category. It measures how well the feature is linearly encoded within a single category manifold.
Local-Global Gap ( $\Delta E$ ): The additional error incurred when forcing a single category-independent regressor to work across all categories. This gap captures the geometric misalignment of manifolds across the population.

They derived a theory predicting $\Delta E$ based on three geometric sources of error:

Centroid Error ( $E_c$ ): Mismatch in the mean positions (centroids) of category manifolds relative to the feature values.
Scale Error ( $E_s$ ): Variability in the "scale" (magnitude of the readout vector) of the feature encoding across different categories.
Orientation Error ( $E_o$ ): Misalignment of the local feature-encoding directions across categories. This is the dominant term, defined by the alignment index ( $a$ ) and the Signal-to-Noise Ratio (SNR) of the encoding direction relative to uninformative manifold variance.

3. Key Contributions

Existence of Joint Codes: Demonstrated that CNNs can be trained to implement a single population code that supports optimal linear readout of both object category and continuous features, matching the performance of networks specialized for single tasks.
Theory of Manifold Regression: Developed a novel theoretical framework decomposing regression error into local and local-global components. This identifies manifold alignment and encoding scale consistency as the critical geometric properties for joint coding, distinct from the manifold radius/dimensionality properties critical for classification.
Optimization Strategy: Showed that optimizing for regression does not require reshaping the entire manifold (preserving classification geometry). Instead, the network achieves joint coding by re-encoding the feature along the manifold's dominant principal components (increasing SNR) and aligning these directions across categories, leaving the overall manifold shape and centroid separation largely unchanged.
Experimental Constraints Analysis: Quantified how neural unit subsampling and limited category sets (finite $P$ ) artificially reduce the local-global gap, potentially leading to false negatives in biological studies. The theory provides a method to extrapolate finite-sample results to the infinite-category limit.

4. Key Results

Performance: Network CR achieved classification accuracy equal to Network C and regression error equal to Network R. Crucially, the regression error of Network C was significantly higher than Network CR.
Error Decomposition:
- In Network C (classification-only), the local-global gap ( $\Delta E$ ) was large, driven primarily by Orientation Error ( $E_o$ ). The encoding directions for position/size were misaligned across categories.
- In Network CR, $\Delta E$ dropped by orders of magnitude. This was achieved by drastically reducing both Orientation Error (via better alignment $a$ and higher SNR) and Scale Error ( $E_s$ ).
- Centroid Error ( $E_c$ ) was negligible in all cases.
Geometric Invariance: Despite the massive reduction in regression error, the manifold shapes (eigenvalue spectra) and centroid separations in Network CR remained nearly identical to Network C. This proves that regression-relevant geometry can be optimized independently of classification-relevant geometry.
Hierarchy Trends: In Network C, regression performance improved slightly with depth, but the local-global gap remained constant. In Network CR, the gap collapsed only in the final layers, coinciding with the rise in classification accuracy.
Subsampling Effects:
- Unit Subsampling: When the number of recorded units dropped below ~200, the local-global gap in Network CR increased to match Network C, making the two indistinguishable. This suggests that current biological recordings (often <200 units) may be too small to detect joint codes.
- Category Subsampling: Using a small number of categories ( $P$ ) leads to overfitting of the global regressor, artificially reducing $\Delta E$ . The authors' theory allows extrapolation to the $P \to \infty$ limit to correct for this bias.

5. Significance and Implications

Neuroscience: The paper provides a principled explanation for why previous studies in the macaque ventral stream might have found limited regression performance. It suggests that the "joint code" hypothesis may be valid but obscured by limited recording populations and small category sets. Future experiments should aim for larger-scale recordings and analyze the local-global gap (rather than just global error) to test for joint coding.
Machine Learning: It offers a geometric understanding of how multi-task learning works in deep networks. It demonstrates that a network can learn to encode continuous variables without sacrificing categorical separability, provided the encoding directions are aligned across the manifold space.
Theoretical Advancement: By bridging the gap between manifold theory (traditionally used for classification) and regression, the authors provide a unified geometric language for understanding how the brain (and AI) represents complex, multi-dimensional stimuli.

In summary, the paper establishes that joint coding is geometrically feasible and identifies manifold alignment as the key mechanism, while cautioning that standard experimental constraints in neuroscience may currently mask these signatures.

Manifold geometry underlies a unified code for category and category-independent features