Estimating Dimensionality of Neural Representations from Finite Samples

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to figure out how many unique ingredients are in a giant, complex soup.

In the world of neuroscience and artificial intelligence, "neural representations" are like that soup. When a brain (or a computer) sees a picture of a cat, thousands of neurons fire at once. Each neuron adds a little bit of flavor. The "dimensionality" is simply a measure of how many distinct, independent flavors are actually contributing to that taste. Is it just "cat-ness"? Or is it a complex mix of "fur texture," "whisker shape," "eye color," and "tail movement"?

The Problem: The "Small Spoon" Mistake

For a long time, scientists tried to count these flavors by taking a spoonful of the soup (a small sample of data) and guessing the total number of ingredients based on that spoonful.

The paper points out a major flaw in this method: The size of your spoon matters too much.

The Naive Approach: If you use a tiny spoon (few data points), you might only taste the salt and miss the pepper. You'll think the soup has only 1 flavor.
The Big Spoon: If you use a giant ladle (lots of data), you taste everything and realize there are actually 50 flavors.

Existing methods were like a scale that gave you different weights depending on how much you put on it. If you only had a few neurons recorded or a few images shown, the math would lie to you, making the "complexity" of the brain look much simpler than it really is. This is called bias.

The Solution: The "Bias-Corrected" Recipe

The authors of this paper (Chanwoo Chun, Abdulkadir Canatar, et al.) invented a new mathematical recipe to fix this. They realized that the error comes from the fact that when you take a small sample, some ingredients get counted twice by accident, or others are missed entirely.

They created a corrected estimator (a new way of doing the math) that:

Ignores the spoon size: It gives you the same answer whether you have 10 data points or 10,000.
Filters out the noise: Real-world data is messy (like a soup with some dirt in it). Their method can tell the difference between a real flavor and a speck of dirt.
Works with "Local" flavors: They can also zoom in to see how complex a specific part of the soup is, not just the whole pot.

How They Tested It

To prove their new recipe works, they did three things:

Synthetic Soup (Fake Data): They made up a perfect soup where they knew the exact number of ingredients (e.g., exactly 50). When they used the old "naive" method with small samples, they guessed wrong (e.g., 10 or 20). When they used their new method, they got 50 every single time, no matter how small the sample was.
Real Brain Soup: They applied this to real data from:
- Mouse brains (watching them look at images).
- Monkey brains (recording electrical signals).
- Human brains (using MRI scans).
- Result: The old methods kept changing their minds as they added more data. The new method stayed steady, revealing the true complexity of the brain's activity.
AI Soup (Large Language Models): They looked at how AI models (like the ones powering chatbots) "think." They found that as you go deeper into the AI's layers, the complexity of its thinking changes in a specific pattern. The old methods missed the fine details of this pattern, but the new method revealed it clearly.

The Big Picture: Why Does This Matter?

Think of dimensionality as a measure of how much information a system is holding.

For Brain-Computer Interfaces (BCI): If you want to build a device that lets a paralyzed person control a robotic arm with their thoughts, you need to know exactly how many "control knobs" (dimensions) the brain has. If you guess wrong because of a small sample, your device won't work well. This new method ensures the device is tuned to the brain's true complexity.
For AI Safety: If we want to understand what an AI is "thinking" about dangerous topics, we need to know how complex its internal representation is. This tool helps us peek inside the "black box" of AI more accurately.
For Science: It stops researchers from drawing wrong conclusions just because they didn't have enough data. It levels the playing field so that a study with 50 participants can be compared fairly to one with 5,000.

In a Nutshell

The paper says: "Stop guessing the complexity of a system based on how much data you happened to collect. We have a new math trick that corrects for the size of your sample, so you can always find the true number of 'flavors' in the mix."

It's like having a magic spoon that tells you the true recipe of the soup, even if you only took a tiny sip.

1. Problem Statement

The paper addresses a critical limitation in neuroscience and machine learning: the estimation of the global dimensionality of neural representation manifolds from finite data.

Context: Neural population responses to stimuli form high-dimensional manifolds. Understanding the dimensionality of these manifolds is crucial for interpreting brain function, designing Brain-Computer Interfaces (BCIs), and analyzing Large Language Models (LLMs).
The Challenge: Existing estimators, particularly the Participation Ratio (PR) of eigenvalues, are highly sensitive to sample size (the number of stimuli $P$ $P$ and neurons $Q$ $Q$ ).
- When $P$ and $Q$ are finite, the naive PR estimator is systematically biased, often underestimating the true dimensionality.
- The bias arises because the naive estimator fails to account for the statistical correlations introduced by overlapping indices when summing over finite samples.
- Existing "local" dimensionality estimators (e.g., TwoNN) are often sensitive to noise and cannot measure global dimensionality.
Goal: Develop a rigorous, bias-corrected estimator for global dimensionality that is invariant to sample size and robust to measurement noise.

2. Methodology

2.1 Theoretical Framework

The authors model neural data as a matrix $\Phi \in \mathbb{R}^{P \times Q}$ , where rows are stimuli and columns are neurons. They assume this is a random submatrix of a larger, unobserved infinite matrix $\Phi^{(\infty)}$ .

Participation Ratio (PR): Defined as $\gamma = (\sum \lambda_i)^2 / \sum \lambda_i^2$ , where $\lambda_i$ are eigenvalues of the covariance matrix. This serves as a "soft" count of non-zero dimensions.
Naive Estimator: Simply substituting the sample matrix $\Phi$ into the PR formula. The authors prove this is biased because the expectation of the ratio is not the ratio of expectations, and individual terms in the numerator and denominator are biased due to index overlaps (e.g., terms where $i=j$ or $\alpha=\beta$ ).

2.2 Bias-Corrected Estimator ( $\gamma_{both}$ )

The core contribution is a derivation of unbiased estimators for the numerator ( $A$ ) and denominator ( $B$ ) of the PR formula by restricting summations to mutually distinct indices.

Unequal Index Summation: Instead of summing over all indices, the estimator sums only over indices where $i \neq j \neq l \neq r$ (for rows) and $\alpha \neq \beta$ (for columns).
Implementation: To make this computationally feasible, the authors derive algebraic expansions that express these "unequal" sums as linear combinations of standard vectorized sums (using einsum operations).
- Example expansion: $\sum_{i \neq j} u_{ij} = \sum_{i,j} u_{ij} - \sum_i u_{ii}$ .
Resulting Estimator: $\gamma_{both} = \hat{A}_{both} / \hat{B}_{both}$ , where $\hat{A}_{both}$ and $\hat{B}_{both}$ are the unbiased estimators derived via unequal index summation.

2.3 Extensions

The framework includes several advanced extensions:

Noise Correction: By using two independent trials ( $\Phi^{(1)}$ and $\Phi^{(2)}$ ) and constructing cross-products (e.g., $\Phi^{(1)}_{i\alpha} \Phi^{(2)}_{j\alpha}$ ), the estimator cancels out additive and multiplicative noise, reducing noise-induced bias from $O(1/\sqrt{N})$ to $O(1/P + 1/Q)$ .
Importance Sampling: Allows correction for non-uniform sampling distributions of stimuli or neurons by weighting terms in the summation.
Sparse Data: Adaptable to matrices with missing entries (e.g., missing neurons in specific trials) by adjusting the normalization of the sums.
Local Dimensionality: By weighting samples based on their distance to a query point (using a local ball radius $r$ ), the method can estimate local (intrinsic) dimensionality while remaining robust to noise, unlike the popular TwoNN method.

3. Key Results

3.1 Synthetic Data

On synthetic linear generative models with known true dimensionality ( $d=50$ ), the naive estimator ( $\gamma_{naive}$ ) showed significant bias, scaling roughly as the harmonic mean of $P$ , $Q$ , and $\gamma$ .
The proposed estimator ( $\gamma_{both}$ ) successfully recovered the true dimensionality across a wide range of $P$ and $Q$ , converging to the ground truth even with small sample sizes.

3.2 Real Neural Data

The method was applied to four diverse datasets:

Mouse V1 (Calcium imaging)
Macaque V4 (Local Field Potentials)
Macaque IT (Spike-sorted)
Human IT (fMRI)

Findings: As $P$ (stimuli) or $Q$ (neurons) were subsampled, $\gamma_{naive}$ varied significantly. $\gamma_{row}$ (correcting only row sampling) and $\gamma_{col}$ (correcting only column sampling) plateaued at intermediate biased values.
Invariance: $\gamma_{both}$ remained remarkably constant regardless of the number of samples, indicating it successfully isolates the underlying manifold dimensionality from sampling artifacts.

3.3 Large Language Models (LLMs)

Applied to hidden layer representations of Llama 3 across 9 languages.
Findings: The naive estimator significantly underestimated dimensionality in early layers. The bias-corrected estimator revealed a more nuanced profile: dimensionality increases in mid-layers and decreases in later layers. This fine-grained structure was previously obscured by sampling bias.

3.4 Local Dimensionality

On synthetic Random Fourier Feature (RFF) models and real Macaque V1 data, the local estimator $\gamma_{local}_{both}(r)$ accurately recovered the true local dimensionality as the radius $r$ decreased.
In contrast, the TwoNN estimator significantly overestimated dimensionality in the presence of noise, confirming the superiority of the proposed method for noisy, finite datasets.

4. Significance and Impact

Resolution of a Long-Standing Bias: The paper provides the first rigorous, distribution-free estimator for global dimensionality that corrects for finite-sample bias in both row and column sampling.
Robustness to Noise: The cross-trial noise correction method allows for accurate dimensionality estimation from noisy neural recordings (e.g., calcium imaging, fMRI) without requiring massive numbers of trials.
Interpretability in AI: By providing a reliable measure of dimensionality in LLMs, the method aids in understanding how information is processed and compressed across layers, which is vital for AI safety and interpretability research.
Practical Utility: The method is computationally efficient (using standard vectorized operations) and applicable to diverse modalities (spikes, LFP, fMRI, synthetic data), making it a versatile tool for the neuroscience and ML communities.
Local vs. Global: It bridges the gap between global and local dimensionality estimation, offering a noise-robust alternative to TwoNN for studying the geometry of curved manifolds.

5. Conclusion

The authors successfully close the gap in dimensionality estimation by deriving a bias-corrected Participation Ratio estimator. This tool enables researchers to accurately quantify the complexity of neural representations and artificial network activations, regardless of the constraints imposed by finite data collection and measurement noise. The code is publicly available, facilitating immediate adoption in future research.