SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

Imagine you are an artist who can paint pictures directly from someone's brainwaves. You ask a person to look at a photo of a teddy bear, and you use a super-computer to translate their brain activity into a new image.

Now, imagine you show that new painting to a group of people and ask, "How close is this to the original teddy bear?"

The People say: "It looks like a cat. It's cute, but it's definitely not a bear."
The Old Computer Score says: "98% Perfect! Great job!"

This is the problem this paper, SEED, is trying to solve.

The Problem: The "Fake Perfect" Score

For a long time, scientists measuring brain-decoding models have used a set of standard "rulers" (metrics) to grade how good the AI is at painting what it sees in the brain.

Think of these old rulers like a strict geometry teacher. They check if the lines are straight and the colors are in the right spots. If the AI draws a cat instead of a bear, but the cat is the same size and color as the bear, the geometry teacher gives it an A+.

But humans don't grade like that. We care about meaning. If you ask for a bear and get a cat, you failed, even if the cat is drawn perfectly. The old rulers were giving "A+" grades to paintings that were semantically wrong, making researchers think the technology was better than it actually was.

The Solution: SEED (The "Human-Like" Judge)

The authors created a new grading system called SEED (Semantic Evaluation for Visual Brain Decoding). Instead of just checking lines and pixels, SEED tries to grade the painting the way a human does. It uses three different "judges" to give a final score:

The Object Detective (Object F1):
- Analogy: Imagine a game of "I Spy."
- How it works: This judge looks at the original photo and the AI's painting and asks, "Did the AI find the main things?" If the original has a dog and a ball, does the painting have a dog and a ball? If the AI swapped the dog for a cat, this judge gives a low score. It's like checking if the ingredients in a cake are actually flour and eggs, not just if the cake looks round.
The Storyteller (Cap-Sim):
- Analogy: Imagine two people describing a photo to a blind friend.
- How it works: This judge asks an AI to write a sentence describing the original photo and another sentence for the AI's painting. Then, it compares the stories.
- Example: If the original is "A man skiing on a snowy hill" and the painting is "A woman skiing on a sunny beach," the stories are very different. Even if the shapes look similar, the story is wrong. This catches details like gender, background, and actions that the Object Detective might miss.
The Vibe Checker (EffNet):
- Analogy: A quick gut feeling.
- How it works: This is an existing tool that looks at the overall "feel" and structure of the image. It checks if the general vibe matches, acting as a safety net to make sure the picture isn't just a random mess.

The Final Score: SEED takes the average of these three judges. If the AI gets a high score, it means it got the objects right, the story right, and the vibe right.

What Did They Find?

The authors tested this new system on the best brain-decoding models currently available (the "champions" of the field).

The Shocking Result: Even the "champion" models, which were getting near-perfect scores on the old rulers, were actually failing the SEED test.
The "Near-Miss" Problem: They found that models often confuse similar things. They might turn a dog into a cat, or a truck into a bus. To the old rulers, this was a small mistake. To SEED (and humans), it's a big failure.
The Missing Details: Sometimes the models got the main object right (a bird) but missed the details (the bird was facing the wrong way, or the background was a jungle instead of a forest).

Why Does This Matter?

Think of it like training a student.

If you only use the Old Rulers, you tell the student, "You're doing great!" even when they are drawing the wrong animal. The student stops trying to improve because they think they've already won.
With SEED, you tell the student, "You drew a cat, but I asked for a bear. You need to learn the difference."

The Takeaway

This paper is a wake-up call. It says, "Stop using the old, broken rulers that give fake perfect scores." By using SEED, researchers can finally see the real mistakes their AI is making. This will help them build better brain-decoding tools that don't just look "okay" to a computer, but actually make sense to human brains.

In short: SEED is the new, honest teacher that makes sure the AI is actually learning to see what we see, not just guessing the right shape.

1. Problem Statement

Visual brain decoding aims to reconstruct visual stimuli (images) from brain signals (e.g., fMRI). While recent diffusion-based models have achieved near-perfect scores on standard evaluation metrics (e.g., CLIP, Inception, AlexNet), the authors argue that these metrics fail to align with human perception.

The Gap: Current metrics often assign high scores to reconstructions that are semantically flawed (e.g., a teddy bear reconstructed as a cat, or a specific object replaced by a similar super-category object).
Limitations of Existing Metrics:
- Two-way Identification: Metrics like CLIP or Inception rely on comparing a reconstruction against a pool of other reconstructions. This makes cross-model comparison invalid and allows models to "win" by simply being closer to the ground truth than random noise, rather than being truly accurate.
- Lack of Human-Likeness: Metrics based on abstract feature correlations (e.g., SSIM, FID) do not capture semantic nuances like object identity, background context, or specific attributes (pose, color) in a way humans do.
- Semantic Near-Misses: Models often reconstruct the correct supercategory (e.g., "animal") but the wrong specific object (e.g., "cow" instead of "sheep"), a failure mode existing metrics miss.

2. Methodology: The SEED Framework

The authors propose SEED (Semantic Evaluation for Visual Brain Decoding), a composite metric designed to mimic the human visual attention process (parallel feature analysis followed by focused object binding). SEED integrates three complementary components:

A. Object F1 (Object Presence & Identity)

Inspired by the "focused attention" stage of human vision, this metric evaluates whether specific objects exist in the reconstruction.

Mechanism: Uses an open-vocabulary image grounding model (MM-Grounding-DINO) to detect objects in both the Ground Truth (GT) and the Reconstruction.
Calculation:
- Object Recall: Proportion of GT object categories found in the reconstruction.
- Object Precision: Proportion of reconstruction object categories found in the GT.
- Object F1: The harmonic mean of Recall and Precision.
Robustness: To avoid dependency on a single confidence threshold, the metric averages scores across a range of thresholds (0 to 1).

B. Cap-Sim (Semantic Caption Similarity)

This metric captures high-level semantic factors often missed by object detection, such as background, pose, color, and action.

Mechanism: Generates text captions for both GT and Reconstruction using an image captioning model (GIT).
Calculation: Computes the cosine similarity between the text embeddings of the two captions using a Sentence Transformer.
Significance: It evaluates the "story" of the image, not just the presence of objects.

C. EffNet (Global Structural Similarity)

A modified version of the widely used EfficientNet metric.

Mechanism: Uses an ImageNet-pretrained EfficientNet to extract image embeddings.
Modification: Instead of calculating correlation distance, it calculates correlation similarity (higher is better) to align with the other components. This captures global structural and scene-level consistency.

D. Final SEED Score

The final metric is the simple average of the three components:
$\text{SEED} = \frac{\text{Object F1} + \text{Cap-Sim} + \text{EffNet}}{3}$

3. Key Contributions

New Evaluation Metric (SEED): A novel, human-aligned metric that outperforms all existing standard metrics in correlating with human judgments of semantic similarity.
Human Evaluation Dataset: The authors collected human ratings (5-point Likert scale) for 1,000 GT-Reconstruction pairs from the Natural Scenes Dataset (NSD) using 22 evaluators. This dataset is open-sourced to facilitate future research.
Identification of Failure Modes: Through SEED, the authors identified two critical failure modes in state-of-the-art models:
- Semantic Near-Miss: Reconstructing the correct supercategory but the wrong specific object (e.g., dog $\to$ cat).
- Detail Loss: Correctly identifying main objects but failing to reconstruct crucial details like background, pose, or color.
Open Source: Release of code, human evaluation data, and the SEED framework at https://github.com/Concarne2/SEED.

4. Experimental Results

The authors conducted "meta-evaluations" to measure how well each metric correlates with human ratings (using Pairwise Accuracy, Kendall's Tau-b, and Pearson correlation).

Alignment with Humans:
- SEED achieved the highest alignment with human evaluation across all metrics (Pearson $r = 0.813$ , Pairwise Accuracy $81.0\%$ ).
- EffNet was the best-performing existing metric (Pearson $r = 0.748$ ), but SEED significantly outperformed it (statistically significant difference via bootstrapping).
- Standard metrics like PixCorr, SSIM, and CLIP showed very low correlation with human judgment (Pearson $r < 0.5$ ).
Robustness: SEED remained robust across different datasets (NSD and GOD) and different decoding models (MindEye2, Mind-Vis). It also showed minimal sensitivity to the specific choice of off-the-shelf models used for grounding or captioning.
Model Re-evaluation: When applied to state-of-the-art models (MindEye2, NeuroPictor, etc.), SEED revealed that despite high scores on traditional metrics, these models frequently suffer from:
- Semantic Near-Miss Rates: Ranging from 17.5% to 20.6%.
- Semantic Detail Misses: 8.3% to 10.7% of reconstructions had high Object F1 but low overall SEED due to missing background or attribute details.

5. Significance and Future Implications

Correcting the Field: The paper demonstrates that the visual brain decoding field may be overestimating progress because current metrics are too lenient on semantic errors. SEED provides a stricter, more human-consistent standard.
Guidance for Model Development: By isolating specific failure modes (object identity vs. semantic details), SEED guides researchers to improve data diversity (e.g., more background variations) and training strategies (e.g., decoupling object supervision from attribute supervision).
Future Directions: The authors suggest that as models mature, evaluation must shift from simple identification to fine-grained semantic fidelity. They also note that SEED relies on off-the-shelf models, which could inherit their own biases, suggesting a future need for dedicated evaluation models.

In conclusion, SEED represents a paradigm shift in how visual brain decoding is assessed, moving away from abstract feature correlations toward a holistic, human-centric evaluation of semantic accuracy.