Gradient based Severity Labeling for Biomarker Classification in OCT

Imagine you are trying to teach a computer to spot tiny, specific problems in a patient's eye scans (called OCT scans), like finding a single drop of water in a massive ocean or a tiny crack in a windshield. These problems are called biomarkers.

The big challenge? Doctors are busy, and labeling thousands of these scans with "yes, there's a problem here" or "no, it's clean" takes a lot of time and money. So, we have a huge pile of unlabeled scans (we don't know what's in them) and a tiny pile of labeled scans (we know exactly what's in them).

This paper proposes a clever way to use that huge pile of unlabeled scans to help the computer learn better, without needing a doctor to label every single one. Here is how they did it, using some simple analogies:

1. The Problem with "Random" Learning

Usually, when computers learn from unlabeled data, they use a trick called Contrastive Learning. Think of this like a game of "Find the Match."

The Old Way: You take one photo, apply random filters (like blurring it or changing the colors), and tell the computer, "These two look the same." Then you show it a totally different photo and say, "This one is different."
The Medical Problem: In medical images, those random filters are dangerous. If you blur an eye scan, you might accidentally blur out the tiny biomarker you are trying to find! It's like trying to find a specific crack in a windshield by smearing the glass with Vaseline. You might miss the crack entirely.

2. The New Idea: Grouping by "Sickness Level"

Instead of random filters, the authors asked: What if we group scans based on how "sick" they look?

Imagine a hospital waiting room.

Healthy people are sitting in one corner.
People with a slight cold are in another.
People with a severe flu are in a third.
People with a critical condition are in the ICU.

Even if we don't know exactly what disease each person has, we can tell who looks "sicker" than the others. The authors realized that scans with similar "sickness levels" (severity) likely share similar structural features, making them perfect "matches" for the computer to learn from.

3. The Magic Tool: The "Gradient Score"

But how do we know who is "sicker" without a doctor? We can't just ask the computer to guess.

The authors used a clever mathematical trick involving Gradients.

The Analogy: Imagine the computer has a "muscle memory" of what a healthy eye looks like.
When it looks at a healthy eye, it barely needs to adjust its thinking. It's like walking on a flat path; your muscles stay relaxed.
When it looks at a sick eye, it has to "stretch" and "strain" to understand what it's seeing. It has to make a big mental adjustment.

The authors measured exactly how much the computer had to "strain" (the gradient) to understand each scan.

Low Strain = Healthy (Low Severity Score).
High Strain = Sick (High Severity Score).

This gave them a "Severity Score" for every single unlabeled scan, effectively sorting the unlabeled pile into buckets from "Very Healthy" to "Very Sick."

4. The Training Process

Once they had these "Severity Buckets," they taught the computer in two steps:

Step 1 (The Grouping): They told the computer, "All the scans in the 'Medium Sickness' bucket are similar to each other. All the scans in the 'Severe Sickness' bucket are similar to each other." They used this to build a strong mental map of eye structures.
Step 2 (The Fine-Tuning): After the computer learned the general map, they took the small pile of actually labeled data (where doctors said "This is IRF," "This is DME," etc.) and gave the computer a quick final exam to learn the specific names of the diseases.

The Result

By using this "Severity Sorting" method instead of random blurring, the computer got much better at spotting the tiny biomarkers.

It improved accuracy by up to 6% compared to other methods.
It proved that you don't need to know the exact disease name to learn from data; you just need to know how "abnormal" the image looks compared to a healthy one.

In a Nutshell

Instead of guessing what's in the dark, the authors built a "sickness meter" using math. They sorted thousands of unlabeled eye scans from "Healthy" to "Sick" based on how hard the computer had to work to understand them. This allowed the computer to learn the patterns of disease much faster and more accurately, helping doctors detect eye diseases earlier.

1. Problem Statement

The detection of retinal biomarkers (e.g., Intraretinal Fluid, Diabetic Macular Edema) in Optical Coherence Tomography (OCT) scans is critical for managing Diabetic Retinopathy (DR). However, deep learning approaches for this task face a significant bottleneck: the reliance on large, expert-annotated datasets. Expert grading is expensive and time-consuming, resulting in a scarcity of labeled data.

While Contrastive Learning has emerged as a solution to leverage unlabeled data, traditional methods (e.g., SimCLR, MoCo) rely on arbitrary data augmentations (e.g., Gaussian blurring, cropping) to create positive pairs. In the medical domain, these augmentations are problematic because they can distort or occlude small, localized regions containing critical biomarkers, thereby corrupting the semantic information required for diagnosis. The authors argue that a more intuitive approach for medical images is to select positive pairs based on disease severity rather than random augmentation, as samples with similar severity levels likely share structural features related to disease progression.

2. Methodology

The proposed framework introduces a novel pipeline to generate pseudo-severity labels for unlabeled OCT scans and utilize them in a Supervised Contrastive Learning setup. The methodology consists of three main stages:

A. Severity Score Generation via Gradient Analysis

Instead of using standard reconstruction error alone, the authors propose using gradient responses from an anomaly detection model to quantify disease severity.

Training a Healthy Baseline: An auto-encoder is trained exclusively on a dataset of healthy OCT scans using the GradCON methodology. This imposes a gradient constraint, forcing the model's gradients for healthy images to align closely.
Calculating Severity Score (SS): For any unlabeled image, the model computes a severity score based on how "anomalous" the image is relative to the learned healthy distribution. The score is defined as:
$SS = -L_{recon} + \alpha L_{grad}$
Where:
- $L_{recon}$ is the Mean Squared Error (MSE) between the input and its reconstruction.
- $L_{grad}$ is the average cosine similarity between the gradients of the target image and the reference gradients learned from healthy data across all decoder layers.
- $\alpha$ is a weighting hyperparameter (set to 0.03).
- Intuition: Anomalous (diseased) samples require a more drastic model update (larger gradients) to be represented than healthy samples. Thus, a higher SS indicates a more severe anomaly.

B. Pseudo-Label Assignment

Once severity scores are calculated for the entire pool of unlabeled data:

The scores are ranked in ascending order.
The ranked scores are divided into $N$ bins (hyperparameter).
Images within the same bin are assigned the same Severity Label (SL). This creates a set of $N$ pseudo-classes representing a spectrum of disease severity from healthy to severe.

C. Supervised Contrastive Learning & Fine-Tuning

Pre-training: An encoder (ResNet-18) is trained using a Supervised Contrastive Loss on the unlabeled data with the generated Severity Labels. The loss function pulls embeddings of images with the same severity label together and pushes apart embeddings of different severity labels.
Fine-tuning: The trained encoder weights are frozen. A linear classification layer is appended and trained on a small set of expert-labeled biomarker data (binary classification: presence/absence of specific biomarkers) using cross-entropy loss.

3. Key Contributions

Gradient-Based Severity Labeling: A novel method to assign pseudo-labels to unlabeled medical images based on gradient responses from an anomaly detection model, avoiding the pitfalls of arbitrary augmentations.
Severity-Aware Contrastive Learning: Demonstrating that grouping images by disease severity (rather than random augmentation) creates a more semantically meaningful representation space for medical imaging.
Performance Improvement: Showing that this approach significantly outperforms state-of-the-art self-supervised baselines (SimCLR, PCL, MoCo v2) in biomarker classification tasks.

4. Experimental Results

The study was conducted on the Prime + TREX DME datasets (approx. 60k unlabeled, 7.5k labeled) and the Kermany healthy dataset. The model was tested on 5 specific biomarkers: IRF, DME, IRHRF, FAVF, and PAVF.

Comparison with Baselines: The proposed method (SL) achieved superior performance across most metrics compared to SimCLR, PCL, and MoCo v2.
- Multi-Label AUC: The best result (SL with 5,000 bins) achieved 0.774, outperforming the best baseline (MoCo v2 at 0.769).
- Individual Biomarkers:
  - DME: SL15000 achieved 84.52% Accuracy / 0.831 F1, significantly higher than SimCLR (80.61% / 0.772).
  - IRF: SL10000 achieved 75.46% Accuracy / 0.732 F1.
  - PAVF: SL10000 achieved 56.69% Accuracy / 0.370 F1.
Hyperparameter Sensitivity: The number of severity bins ( $N$ $N$ ) significantly impacts performance.
- Moderate bin counts (5,000–10,000) generally yielded the best multi-label performance.
- Specific biomarkers required different granularities; for instance, DME and IRF performed best with higher bin counts (15,000–20,000), suggesting they benefit from finer-grained severity distinctions.
Ablation Study: The authors compared their gradient-based severity scoring against other anomaly detection methods (MSP, ODIN, Mahalanobis). The proposed Gradient-based (SL) method consistently outperformed others, achieving the highest Multi-Label AUC (0.774 vs. 0.772 for Mahalanobis).

5. Significance

This paper addresses a critical limitation in applying contrastive learning to medical imaging: the incompatibility of standard augmentation strategies with delicate medical structures. By shifting the paradigm from "augmentation-based" to "severity-based" positive pair selection, the authors demonstrate that:

Unlabeled data can be effectively leveraged without distorting critical diagnostic features.
Gradient information serves as a robust proxy for disease severity, creating clusters that are semantically interpretable and clinically relevant.
Clinical Impact: The method improves the accuracy of detecting key indicators of Diabetic Retinopathy by up to 6% over self-supervised baselines, offering a practical pathway to reduce reliance on expensive expert annotations in ophthalmology.