Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy

The Big Problem: The "Blind Spot" in Medical Scans

Imagine a doctor looking at a high-resolution photo of a patient's retina (the back of the eye) to check for Diabetic Retinopathy (DR). DR is a disease that damages the eye and can cause blindness if not caught early.

The problem is that the early signs of this disease are tiny, subtle spots (lesions) that look very similar to the normal background. To train a computer (AI) to find these spots, you need to show it thousands of photos where a human expert has drawn a box around every single bad spot.

But here's the catch: Drawing those boxes is incredibly hard, slow, and expensive. Most existing datasets only have a few boxes drawn, or the boxes are drawn loosely, covering healthy tissue along with the bad spots. It's like trying to teach a child to identify "bad apples" in a basket, but you only point to a few apples and sometimes accidentally point to the basket itself. The AI gets confused and learns poorly.

The Solution: SAFE (The Smart Detective)

The authors propose a new system called SAFE (Similarity-based Annotation via Feature-space Ensemble). Think of SAFE as a super-smart detective that can fill in the missing pieces of a puzzle using logic and pattern recognition, rather than needing a human to point at every single piece.

SAFE works in two main stages:

Stage 1: The "Training Camp" (Learning the Vibe)

Imagine you have a small group of students (the AI models) and a few textbooks with some highlighted sentences (the partially labeled images).

The students study these examples.
Instead of just memorizing the words, they learn the "vibe" or the feeling of a "sick" patch versus a "healthy" patch.
They use a technique called Contrastive Learning. Think of this as a game of "Spot the Difference." The AI is forced to learn that two patches that look similar (both healthy) should feel the same, while two patches that look different (one healthy, one sick) should feel very distinct.
They build a mental map (an embedding space) where healthy patches are grouped together in one corner of the room, and sick patches are in another.

Stage 2: The "Group Detective Work" (Filling in the Blanks)

Now, the students face a huge room full of unmarked photos (the unlabeled data).

The Ensemble: Instead of relying on just one student's opinion, SAFE uses a team of three independent detectives (an ensemble).
The Search: For every unknown photo, the detectives look at their mental map. They ask, "Who does this photo look most like?" They find the top 25 closest neighbors in the map.
The Vote: If 20 out of 25 neighbors are "Sick," the photo is likely "Sick." If 20 are "Healthy," it's "Healthy."
The Safety Net (Abstention): This is the clever part. If the neighbors are split (e.g., 12 say Sick, 13 say Healthy), or if the photo looks weird and doesn't fit well with anyone, the team refuses to guess. They mark it as "Undecided."
- Analogy: It's better to say "I don't know" than to guess wrong and send a healthy patient for unnecessary treatment. This keeps the data clean and trustworthy.

Why Is This a Big Deal?

It Saves Time: You don't need a human expert to draw a box around every single tiny lesion. SAFE can take a partially labeled image and automatically label the rest of the tiny spots with high accuracy.
It Catches the Tiny Stuff: Because SAFE looks at small "patches" (tiny squares of the image) rather than the whole blurry photo, it can spot microscopic lesions that other AI models miss.
It's Honest: By having an "Undecided" category, it avoids making up false alarms.
It Makes Other AI Better: When the researchers used the new labels created by SAFE to train other AI models, those models got much better at diagnosing the disease. In some tests, the ability to detect the disease improved by over 50% (measured by a specific score called AUPRC).

The Results in Plain English

The researchers tested SAFE on four different medical datasets.

The Score: It achieved 98.8% accuracy in separating healthy patches from diseased ones.
The Impact: When they used SAFE's "auto-filled" labels to train a standard AI, the AI became significantly better at finding sick patients.
The Validation: Real eye doctors (ophthalmologists) looked at the results and confirmed that SAFE was focusing on the right medical signs, not just random noise.

Summary Analogy

Imagine you are trying to sort a massive pile of mixed-up LEGO bricks (retina images) into "Red" (Sick) and "Blue" (Healthy) buckets.

Old Way: You only have a few red and blue bricks to start with. You try to guess the rest, but you often put red bricks in the blue bucket because you aren't sure.
The SAFE Way: You build a "feel" for the bricks. You look at a mystery brick and ask, "Does this feel like the red ones I know, or the blue ones?" If it feels exactly like a red one, you put it in the red bucket. If it feels like a blue one, you put it in the blue bucket. If it feels weird or like a mix of both, you put it in a "Maybe" box.
The Result: You end up with a much cleaner, more accurate sorting job, and you can teach other robots to do the job even faster using your new, perfectly sorted piles.

In short: SAFE is a smart, cautious, team-based AI system that teaches itself to find tiny eye diseases by learning from imperfect examples, filling in the missing details, and refusing to guess when it's unsure. This makes screening for blindness-causing diseases faster, cheaper, and more accurate.

1. Problem Statement

Diabetic Retinopathy (DR) is a leading cause of preventable vision loss. Early detection relies on identifying subtle pathological lesions (e.g., microaneurysms, hemorrhages, exudates) in fundus images. However, current deep learning approaches face significant challenges due to incomplete and coarse annotations:

Partial Annotation: Expert-annotated datasets often contain only image-level labels (DR vs. No-DR) or sparse lesion masks. Detailed, pixel-level or patch-level annotations are labor-intensive and rarely available for large-scale datasets.
Label Noise: Coarse boundaries in existing annotations often include healthy neighboring pixels, introducing noise into the training data.
Limitations of Existing Methods: Standard semi-supervised learning (SSL) often fails in medical imaging due to class imbalance and noisy labels. Weakly supervised localization and clustering methods often lack the granularity to systematically expand sparse annotations without degrading performance.

The core problem is how to systematically expand sparse, weak annotations into reliable, fine-grained patch-level labels to improve downstream DR screening models without requiring exhaustive manual re-annotation.

2. Methodology: The SAFE Framework

The authors propose SAFE (Similarity-based Annotation via Feature-space Ensemble), a two-stage framework that unifies weak supervision, contrastive learning, and ensemble inference.

Stage 1: Patch Embedding Network (PEN)

The goal is to learn a structured, semantically discriminative embedding space from the limited set of labeled patches.

Architecture: A dual-arm network using a shared encoder ( $f_\theta$ $f_{θ}$ ).
- Arm 1 (Classification): A classification head ( $c_\omega$ ) optimized with Binary Cross-Entropy Loss ( $L_{BCE}$ ) to learn class-discriminative features based on preliminary labels.
- Arm 2 (Contrastive Learning): A projection head ( $g_\phi$ ) optimized with Supervised Contrastive Loss ( $L_{SCL}$ ). This enforces intra-class compactness and inter-class separation in the embedding space, ensuring that semantically similar patches (e.g., similar lesion types) are close together, even if the initial labels are noisy.
Loss Function: The total loss is a weighted combination: $L = L_{SCL} + \lambda L_{BCE}$ . The parameter $\lambda$ balances the need for semantic structure (contrastive) against label fidelity (classification).
Ensemble: To reduce model-specific bias, $M_T$ independent PEN models are trained on different data folds.

Stage 2: Annotation via Feature-space Ensemble

This stage infers labels for the unlabeled patches ( $P_U$ ) using the embedding spaces learned in Stage 1.

Similarity-Based Inference: For each unlabeled patch, the framework computes its embedding in each of the $M_T$ spaces and calculates cosine similarity with all labeled embeddings ( $P_L$ ).
K-Nearest Neighbors (KNN) with Thresholding:
- The top- $K$ nearest neighbors are identified.
- A confidence threshold $\tau$ determines if a label can be assigned. If the proportion of neighbors with a specific label (Healthy/Unhealthy) exceeds $K \times \tau$ , the patch is assigned that label.
- Abstention Mechanism: If the confidence is below the threshold, the patch is labeled "Undecided". This prevents the propagation of noisy labels in ambiguous cases.
Ensemble Voting: The provisional labels from all $M_T$ models are aggregated via majority voting to produce the final annotation. If no majority is reached, the patch remains "Undecided."

3. Key Contributions

Novel Framework (SAFE): A two-stage system that unifies weak supervision and contrastive learning to systematically expand sparse lesion annotations into a fully annotated dataset.
Patch-Level Granularity: Unlike image-level approaches, SAFE operates at the patch level (128x128), preserving the resolution of subtle lesions often lost in downsampling.
Robustness via Ensemble & Abstention: The use of an ensemble of embedding spaces reduces bias, while the "Undecided" category ensures high-fidelity annotations by abstaining from uncertain predictions, effectively managing noise.
New Metrics: Introduction of Decided Rate (Drate) (coverage of annotations) and extended Misclassification Rate (MR) (accounting for undecided samples) to better evaluate weakly supervised annotation quality.
Explainability: The framework focuses on clinically relevant pathological patterns (validated by Grad-CAM and ophthalmologists) rather than background noise.

4. Experimental Results

The framework was evaluated on four datasets: Messidor, IDRiD, e-ophtha, and DDR.

Annotation Quality:
- SAFE achieved high accuracy (up to 0.9886) in separating healthy and diseased patches.
- It demonstrated superior performance in Precision for the Unhealthy class, indicating it rarely mislabels healthy tissue as diseased.
- The Misclassification Rate (MR) was significantly lower than baselines (e.g., Vanilla ResNet, Deep Cluster, Prototype-based Label Transfer), proving its ability to avoid noisy label propagation.
Downstream Task Performance:
- Using SAFE-generated annotations to train downstream classifiers (ResNet18, Inception-V3, ViT) resulted in substantial improvements.
- AUPRC Gain: The most significant improvement was observed in the Area Under the Precision-Recall Curve (AUPRC), with gains as high as 0.545 on the DDR dataset.
- F1-Score: Significant increases in the F1-score for the "Unhealthy" (diseased) class were observed across all datasets, particularly in highly imbalanced scenarios (e.g., e-ophtha).
Ablation Studies:
- Ensemble: Removing the ensemble led to lower F1-scores and higher misclassification rates, confirming the ensemble's role in robustness.
- Loss Functions: The combined loss ( $L$ ) outperformed using $L_{BCE}$ or $L_{SCL}$ alone, achieving the best trade-off between cluster compactness (low Davies-Bouldin index) and distributional alignment (low Wasserstein Distance).

5. Significance and Conclusion

Clinical Utility: SAFE provides a principled, automated method to refine datasets with partial supervision, making high-quality training data accessible for DR screening without massive manual effort.
Scalability: The framework scales effectively from small, imbalanced datasets (e-ophtha) to large-scale datasets (DDR), maintaining high precision.
Trustworthiness: The "abstention" mechanism ensures that the system does not force a decision on ambiguous cases, a critical feature for medical AI where false positives/negatives have severe consequences.
Future Impact: The authors suggest this approach can be extended to other medical domains (e.g., histopathology) and combined with active learning to further minimize expert annotation effort.

In summary, SAFE successfully bridges the gap between weak clinical supervision and the need for fine-grained data, significantly enhancing the performance of automated Diabetic Retinopathy screening systems.