Representational magnitude as a geometric signature ofimage and word memorability

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Question: Why Do Some Things Stick in Our Heads?

Have you ever walked into a room and instantly remembered a specific face, but forgot the name of the person you met five minutes ago? Or maybe you can't forget a weird song that got stuck in your head, while a thousand other songs fade away?

Scientists have long wondered: Is memory just about how hard you try to remember, or is the thing itself just "stickier" than others?

This paper says it's the latter. Some things are just naturally more memorable than others, and the authors found a mathematical "fingerprint" that predicts exactly how sticky a memory will be.

The Discovery: The "Volume Knob" of Memory

Imagine your brain (or a computer brain) as a giant orchestra. When you see a picture or hear a word, different musicians (neurons or features) start playing.

The Old Idea: We thought the direction of the music mattered most (which specific notes were played).
The New Discovery: The authors found that the volume matters even more.

They call this "Representational Magnitude." Think of it like a flashlight.

A dim flashlight (low magnitude) barely illuminates the room. It's easy to miss.
A blindingly bright spotlight (high magnitude) floods the room with light. It's impossible to ignore.

The paper argues that if a stimulus (like an image or a word) turns on more features in your brain, and turns them on very loudly, it leaves a massive "footprint" in your memory. It's not that your brain tries harder to remember it; it's that the initial impression was just so loud and bright that it couldn't be forgotten.

The Experiment: Testing the "Flashlight" Theory

The researchers tested this idea in three different "worlds" to see if the rule applied everywhere.

1. The Visual World (Images) 🖼️

They looked at thousands of pictures (like a photo of a banana or a car). They used a computer model (a neural network) to measure how "bright" the image was in the computer's "mind."

Result: They found that images that lit up the computer's brain the brightest were the ones humans remembered best. This confirmed a previous study, proving the "flashlight" theory works for pictures.

2. The Word World (Language) 📖

This was the big test. Does the rule work for words? They took thousands of words (like "freedom," "apple," or "run") and measured how "loud" they were in a computer language model (Word2vec).

Result: Yes! Words that had a "louder" representation in the computer were the ones people remembered best.
The Catch: They checked if this was just because common words (like "the") are louder. It wasn't. Even rare, complex words followed the rule. If a word makes a big, strong impression in the language system, you remember it.

3. The Sound World (Voices) 🗣️

Finally, they tried it with human voices. They analyzed recordings of people speaking and measured the "loudness" of the sound waves in a computer model.

Result: No. The rule didn't work here. A "loud" voice in the computer didn't mean the human voice was memorable.
Why? The authors guess that remembering a voice depends on different things (like the pitch or the accent) rather than the overall "volume" of the sound features. It's like trying to measure a song's quality by how loud the bass is—it just doesn't capture the whole picture.

The "Recall" vs. "Recognition" Twist

There was one more interesting finding.

Recognition: "Have you seen this before?" (Yes/No). The "flashlight" rule worked perfectly here.
Recall: "Tell me everything you remember." The rule failed here.

The Analogy:
Imagine you are looking for a lost key.

Recognition is like someone showing you a pile of keys and asking, "Is this yours?" If the key is shiny and bright (high magnitude), you spot it instantly.
Recall is like being asked to describe the key from memory without seeing it. Even if the key was bright, you might still struggle to describe it if you weren't actively searching for it.

The "flashlight" helps you spot things, but it doesn't necessarily help you reconstruct them from scratch.

The Takeaway: The "Footprint" Theory

The main lesson of this paper is that memory is built at the moment of encoding.

Think of dropping a stone into a pond.

A tiny pebble (low magnitude) makes a tiny ripple that disappears in seconds.
A giant boulder (high magnitude) creates a massive wave that crashes against the shore and leaves a mark.

The authors suggest that the things we remember best aren't the ones we "try" to remember. They are the things that, by their very nature, hit our brains with the most force, activating the most features at once. Whether it's a picture, a word, or a concept, if it leaves a big footprint, it stays.

Summary in One Sentence

Some things are memorable because they are "louder" in our brain's processing system, leaving a bigger, brighter footprint that is harder to erase, a rule that works for pictures and words, but strangely not for voices.

1. Problem Statement

The central question addressed by the authors is: What intrinsic properties of a stimulus make it more memorable than others?
While previous research established that memorability is an inherent property of stimuli (independent of individual differences) and that in the visual domain, the magnitude of population responses (L2 norm) in both monkey inferotemporal cortex and Convolutional Neural Networks (CNNs) predicts image memorability (Jaegle et al., 2019), two critical gaps remained:

Replicability: Can this effect be replicated in independent, large-scale image datasets?
Generalizability: Is this "representational magnitude effect" specific to visual perception and CNNs (which mimic the visual cortex), or is it a general principle of distributed representations applicable to other modalities (e.g., language, auditory)?

2. Methodology

The study employed a cross-domain approach, analyzing six large-scale datasets covering visual, lexical, and auditory stimuli.

A. Datasets

Visual: The THINGS dataset (Kramer et al., 2023), containing 26,107 naturalistic object images across 1,854 categories, rated by ~13,000 participants for recognition memorability.
Lexical (Word): Three independent datasets totaling over 8,500 memorability scores from ~800 participants:
- Aka et al. (2023): Recognition and recall scores for 576 words.
- Cox et al. (2018): Recognition and recall scores for 924 words.
- Dymarska et al. (2023): Recognition scores for 5,300+ words.
- (Note: Madan (2021) was used for recall analysis only).
Auditory (Voice): Two experiments from Revsine et al. (2025), containing ~600 voice memorability scores (d') from ~2,700 participants listening to speakers from the TIMIT corpus.

B. Computational Models & Feature Extraction

To quantify "representational magnitude," the authors used pre-trained neural networks to generate vector embeddings for each stimulus:

Images: Features extracted from AlexNet (a standard CNN). The L2 norm (Euclidean length) of the activation vectors was calculated for all layers.
Words: Features extracted from Word2vec (GoogleNews-vectors-negative300). The L2 norm of the 300-dimensional static word embeddings was calculated.
Voices: Features extracted from Wav2vec (a self-supervised model for raw audio). The L2 norm of layer activations was calculated.

C. Statistical Analysis

Primary Analysis: Spearman correlation coefficients were computed between the L2 norm of the representation and the memorability score for each stimulus.
Robustness: 95% confidence intervals were generated via non-parametric bootstrapping (10,000 resamples).
Control Analyses: To rule out confounding variables, partial correlations were performed controlling for:
- Images: Object typicality (both concept-based and DNN-based).
- Words: Word frequency, valence, and word length (size).
Task Comparison: Separate analyses were conducted for Recognition vs. Free Recall memory tasks.

3. Key Results

A. Replication in Visual Domain

The study successfully replicated the findings of Jaegle et al. (2019).

In the THINGS dataset, the L2 norm of AlexNet activations showed no significant correlation with memorability in early layers.
However, a significant positive correlation emerged in later layers (convolutional and fully connected), peaking at Layer 7 ( $r = 0.057, p < 0.001$ ).
This relationship remained significant even after controlling for image typicality.

B. Extension to Lexical Domain

The representational magnitude effect generalized to words.

Across all three word datasets, the L2 norm of Word2vec embeddings showed a strong, positive correlation with recognition memorability:
- Aka et al.: $r = 0.32$
- Cox et al.: $r = 0.22$
- Dymarska et al.: $r = 0.47$
Control: The effect persisted after statistically controlling for word frequency, valence, and size, indicating the effect is not merely a proxy for word frequency or emotional content.

C. Failure in Auditory Domain

In the voice memorability datasets (Revsine et al.), no consistent significant relationship was found between the L2 norm of Wav2vec representations and voice memorability.
Correlations were generally near zero or negative, suggesting the effect does not generalize to auditory voice processing in the same way it does for vision and language.

D. Recognition vs. Recall

The effect was robust for Recognition memory across visual and lexical domains.
The effect did not consistently extend to Free Recall. In word datasets, correlations between L2 norm and recall scores were mostly non-significant (except in the Cox dataset, which used paired-associate learning). This suggests the mechanism is specific to item-strength recognition rather than strategic retrieval required for recall.

4. Key Contributions

Cross-Domain Generalization: The paper provides the first evidence that the "representational magnitude effect" (L2 norm predicting memorability) is not limited to the visual cortex or CNNs but applies to lexical representations in non-brain-inspired networks (Word2vec).
Theoretical Unification: It proposes that memorability is an inherent property of encoding strength. Stimuli that activate more features and do so more strongly (leaving a larger "geometric footprint" in representational space) form stronger memory traces.
Mechanistic Insight: The authors connect this geometric property to classical memory models (e.g., SAM, MINERVA). They argue that in dot-product-based similarity spaces, a larger vector magnitude automatically increases "self-similarity," generating a stronger recognition signal without needing auxiliary assumptions.
Boundary Conditions: The study delineates the limits of this effect, showing it does not apply to voice memorability (potentially due to different feature drivers like pitch/prosody vs. semantic features) or free recall tasks (which rely on different neural mechanisms than recognition).

5. Significance

Computational Neuroscience: The findings suggest that the alignment between biological and artificial systems extends beyond specific architectures (like CNNs for vision) to a fundamental geometric property of distributed representations.
Memory Theory: It supports strength-based accounts of memory (Signal Detection Theory), suggesting that the "strength" of a memory trace is determined at the moment of encoding by the magnitude of the neural representation, rather than by downstream consolidation processes.
AI and Human Cognition: The results imply that even models not explicitly trained to predict human memory (like Word2vec) capture intrinsic properties of stimuli that correlate with human memorability, bridging the gap between artificial feature spaces and human cognitive performance.

In conclusion, the paper argues that memorability is a geometric signature: the more a stimulus activates a distributed representation (whether in a brain or a neural network), the more likely it is to be remembered.