Comparison of Deep Learning Tools for Optic Nerve Axon Quantification Finds Limited Generalizability on Independent Validation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to count the tiny threads (axons) inside a bundle of wires (the optic nerve) to see if a disease like glaucoma is damaging them. Doing this by hand is like trying to count every single grain of sand on a beach while wearing thick gloves—it takes forever, and different people will get different counts.

To solve this, scientists built "smart robots" (Deep Learning Models) that can look at microscope pictures and count these threads automatically. These robots were trained in specific labs and showed off amazing results, claiming they were almost perfect.

But here is the twist: This paper asks, "What happens when we take these robots out of their home labs and send them to a completely different lab to do the same job?"

The Story of the "Over-Confident Robots"

Think of these AI models as students who aced a practice test.

The Practice Test (Original Studies): In their original papers, these robots (named AxoNet, AxonDeep, and AxoNet 2.0) got 96% to 99% of the answers right. They looked like geniuses.
The Real Exam (This Study): The authors of this paper took those same robots and gave them a brand new test using pictures from a different lab, with slightly different lighting and different types of rats.

The Result? The robots didn't fail, but they definitely didn't ace the test anymore. Their scores dropped significantly.

The Three Main Takeaways

1. The "Home Court Advantage" Disappears

In sports, a team might win 90% of games at their home stadium but only 60% when they travel. These AI models are the same.

What happened: When the robots were tested on new data, their accuracy dropped. The correlation (how well they matched the human experts) fell from a near-perfect 0.97 down to a "good but not great" 0.79 to 0.89.
The Analogy: It's like a chef who makes a perfect burger in their own kitchen with their specific oven and ingredients. If you ask them to make that same burger in a different kitchen with a different stove, it might still taste okay, but it won't be the exact same masterpiece.

2. The "Picky Eater" Problem

The study looked at how the robots counted. They found a funny pattern:

High Precision (The Picky Eater): When the robot said, "That is an axon," it was usually right. It rarely made mistakes about what it did see.
Low Recall (The Missed Opportunity): However, the robot missed a huge chunk of the axons it should have seen.
The Analogy: Imagine a security guard at a club who is very strict. They let in 100% of the people they do check (no fake IDs get in), but they are so slow and cautious that they only check 20% of the people in line. They are "accurate" about the people they see, but they are missing most of the crowd.
Why it matters: If you are just counting "how many," this might be okay. But if you need to measure the size of the axons (to see if they are shrinking), the robot is failing because it's ignoring the smaller, harder-to-see ones.

3. The "Black Box" Issue

One of the most famous robots, "AxonDeep," was so good in its original paper that everyone wanted to use it. But the authors couldn't test it because the code was hidden (like a secret recipe).

They tried a "cousin" robot called AxonDeepSeg instead.
The Surprise: The robot that had the lowest scores in its original paper (AxoNet 2.0) actually turned out to be the most reliable when tested on the new data.
The Lesson: Just because a model claims to be the "best" in its own study doesn't mean it will be the best for you.

Why Should You Care?

This paper is a reality check for the medical world.

The Good News: These tools are still useful. They are better than doing nothing, and they are faster than humans.
The Bad News: We cannot just download these tools and trust them blindly. If a lab in California uses a tool trained in New York, the results might be off.
The Solution: We need "Standardized Driving Tests." Before these robots are allowed on the road (used in real medical research), they need to be tested on the same standard set of data by different labs to prove they can handle the real world.

The Bottom Line

These AI models are like brilliant students who studied hard for a specific test but haven't learned how to adapt to new questions yet. They are promising tools, but scientists need to be careful, test them rigorously in new environments, and not trust the "perfect scores" from their original papers without a second look.

1. Problem Statement

Context: Histological quantification of optic nerve axons is critical for evaluating neuroprotective interventions in experimental glaucoma models. However, manual counting is labor-intensive, prone to inter-observer variability, and unscalable for large preclinical studies.
The Gap: While deep learning (DL) models have emerged to automate this task, their performance is typically reported only on internal validation sets (within-study). There is a lack of evidence regarding their generalizability to independent datasets with different tissue preparations, staining methods, or laboratory conditions (domain shift).
Objective: To assess whether published DL models for optic nerve axon quantification maintain their reported high performance when applied to a novel, independent dataset, and to identify gaps in the current evidence base.

2. Methodology

The study employed a two-pronged approach: a Scoping Review and an Independent Validation Study.

A. Scoping Review

Protocol: Followed PRISMA-ScR guidelines; registered with PROSPERO.
Search Strategy: Searched PubMed, EMBASE, Scopus, and Cochrane CENTRAL (2000–2025) for studies involving machine learning on optic nerve/RGC histology with quantitative performance metrics.
Inclusion Criteria: Peer-reviewed original research, human/animal optic nerve tissue, ML-based quantification/segmentation, and reported metrics (e.g., correlation, Dice coefficient).
Outcome: From 2,036 records, 4 manuscripts describing 3 distinct deep learning models met inclusion criteria:
1. AxoNet (Ritch et al., 2020): U-Net style, density estimation.
2. AxonDeep (Deng et al., 2021): Semi-supervised GAN architecture.
3. AxoNet 2.0 (Goyal et al., 2023): Refined U-Net with improved augmentation.

B. Independent Validation Study

Dataset: A novel dataset of 57 rat optic nerve cross-sections (PPD-stained, 256x256 pixels) containing 9,514 manually annotated axons (ground truth).
Models Evaluated:
- AxoNet: Used the final resampled checkpoint.
- AxoNet 2.0: Used standard U-Net architecture.
- AxonDeepSeg: Used as a proxy for the non-public AxonDeep. It is a publicly available, U-Net-based tool for general nerve fiber segmentation.
Protocol: Models were applied using default parameters without fine-tuning or adaptation to the new dataset ("out-of-the-box" testing).
Metrics:
- Count Agreement: Pearson correlation ( $r$ ), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).
- Segmentation Quality: Dice coefficient, Intersection over Union (IoU), Precision, Recall.

3. Key Results

A. Published vs. Independent Performance

There was a significant "generalizability gap" between reported literature values and independent validation results:

Model	Published Correlation ( $r$ )	Independent Validation ( $r$ )	Performance Drop
AxoNet	0.97	0.79	-0.18
AxonDeep (Proxy: AxonDeepSeg)	0.97	0.86	-0.11
AxoNet 2.0	0.99	0.89	-0.10

Ranking Reversal: AxoNet, which had the highest published correlation for rat data, performed the worst in independent validation. AxoNet 2.0 showed the most robust performance.

B. Segmentation Quality Analysis

Precision vs. Recall: All models exhibited high precision (>0.94) but very low recall (0.18 – 0.27).
- Interpretation: When the model identified an axon, it was usually correct (few false positives), but it missed the majority of actual axons (high false negatives).
Dice Coefficients:
- Published benchmarks: ~0.81.
- Independent validation: 0.29 to 0.40.
- This indicates a massive failure in pixel-level overlap when applied to new data.

4. Key Contributions

First Independent Validation: This is the first study to rigorously test published optic nerve axon quantification models on an external, independent dataset not used in their original training.
Quantification of the Generalizability Gap: The study provides empirical evidence that DL models in this domain suffer from significant performance degradation (0.07 to 0.18 correlation points) when moved from internal to external validation.
Dissociation of Metrics: It highlights that high correlation in axon counts can mask poor segmentation quality (low recall), which is critical for morphometric analyses (e.g., measuring axon diameter) but less critical for simple counting.
Identification of Robust Architectures: AxoNet 2.0 demonstrated the best generalizability, suggesting that refined training procedures and data augmentation strategies may improve robustness.

5. Significance and Implications

Caution in Adoption: The findings suggest that current "off-the-shelf" DL tools for optic nerve histology are not yet ready for widespread, unvalidated adoption in multi-center studies.
Need for Standardization: The study calls for:
- Standardized Benchmark Datasets: Publicly available datasets with expert consensus annotations to enable fair comparison.
- External Validation: Mandatory validation on independent datasets before model publication.
- Domain Adaptation: Future research should focus on transfer learning and few-shot learning to bridge the gap between training and real-world data distributions.
Clinical/Research Impact: While the independent correlations (0.79–0.89) are still potentially useful for research (comparable to inter-observer variability in manual counting), the low recall means these models systematically underestimate axon counts and areas, which could confound treatment group comparisons if biases are not accounted for.

Conclusion: Deep learning models for optic nerve histology show strong within-study performance but fail to generalize effectively to independent datasets. The field requires standardized validation protocols and domain adaptation techniques before these tools can be reliably deployed across different laboratories.