Ordinal Diffusion Models for Color Fundus Images

Imagine you are an artist trying to teach a robot how to paint pictures of the human eye, specifically to show how a disease called Diabetic Retinopathy gets worse over time.

In the real world, this disease doesn't just jump from "healthy" to "sick" in giant leaps. It's a slow, continuous slide. A healthy eye slowly develops tiny spots, then more spots, then bleeding, and finally, new, messy blood vessels.

However, doctors usually label these eyes with simple, separate categories, like steps on a ladder:

Step 0: Healthy
Step 1: Mild
Step 2: Moderate
Step 3: Severe
Step 4: Proliferative (Very bad)

The Problem with Old AI

Previous AI models treated these steps like completely different languages. If you asked the AI to draw a "Step 1" eye, it learned that as a totally separate concept from a "Step 2" eye. It didn't understand that Step 2 is just Step 1 with a little bit more damage. It was like teaching a child that "one apple" and "two apples" are unrelated concepts, rather than just adding one more apple to the pile.

Because of this, when these old AIs tried to draw the disease getting worse, the changes were often jumpy, unrealistic, or just plain wrong.

The New Solution: The "Ordinal" Diffusion Model

The researchers in this paper built a smarter AI called an Ordinal Diffusion Model. Here is how they did it, using some simple analogies:

1. The Volume Knob instead of Buttons

Instead of giving the AI a set of buttons labeled "0, 1, 2, 3, 4," they gave it a volume knob (a slider).

They told the AI: "Turn the knob to 0 for a healthy eye. Turn it to 1 for mild, 2 for moderate, and so on."
Because the knob moves smoothly, the AI learned that moving from 1 to 2 is just a small adjustment, not a total reboot. This allows the AI to generate images that show a smooth transition of the disease, just like it happens in real life.

2. The "Skeleton" and the "Skin"

The researchers realized that to make a realistic eye, you need two things:

The Skeleton: The unique shape of the blood vessels and the optic disc (the "eye" of the eye). This stays the same person, even if they get sick.
The Skin: The disease symptoms (the spots, bleeding, etc.).

They taught the AI to separate these. Imagine a mannequin (the skeleton) that stays exactly the same, while a makeup artist (the disease generator) slowly adds more and more "bruises" and "rashes" as you turn the volume knob up. This ensures that the AI doesn't accidentally change the person's eye shape while trying to make them look sicker.

What Did They Find?

They tested this new AI on thousands of real eye photos. Here is what happened:

Better Art: The pictures the AI made looked much more realistic than before. If you looked at them, you could clearly see the disease getting worse step-by-step.
Smarter Transitions: When they asked the AI to draw an eye that was "halfway" between Mild and Moderate, it didn't just pick one or the other. It drew an eye with a mix of symptoms, perfectly capturing the messy reality of disease progression.
The "Time Travel" Effect: They took a photo of a healthy eye and asked the AI to "turn up the disease knob" to show what that specific person's eye would look like if they got worse. The AI kept the person's unique eye structure but added the correct amount of damage. It was like a "what if" simulator for disease.

Why Does This Matter?

In medicine, we often don't have enough photos of people with severe diseases or from certain ethnic groups to train our AI doctors. This new model acts like a photocopier for reality. It can create thousands of new, realistic, and medically accurate eye images to fill in the gaps.

By understanding that disease is a spectrum (a continuous slide) rather than a list of separate boxes, this AI helps doctors train better diagnostic tools, potentially leading to earlier detection and better care for patients with diabetic retinopathy.

In short: They taught the AI to stop thinking in "steps" and start thinking in "slides," resulting in a much smarter, more realistic medical artist.

1. Problem Statement

Context: Diabetic Retinopathy (DR) is a leading cause of preventable vision loss, monitored via retinal color fundus photography (CFP). While deep learning models perform well in DR detection, they are often limited by a lack of diverse training data, particularly for late-stage disease or underrepresented ethnicities. Generative models (like Diffusion Models) offer a solution by synthesizing realistic data to augment training sets.

The Core Issue: Most existing conditional diffusion models treat disease stages (e.g., No DR, Mild, Moderate, Severe, Proliferative) as independent categorical classes. This ignores the fundamental nature of DR, which is a continuous pathological process that progresses through ordered, discrete stages.

Consequence: Standard models fail to capture the smooth transitions between stages, potentially generating unrealistic images or failing to model the continuum of disease progression effectively.
Goal: To develop a generative model that explicitly incorporates the ordinal structure of disease severity to enable smooth, clinically consistent transitions between DR stages.

2. Methodology

The authors propose an Ordinal-Aware Latent Diffusion Model that explicitly encodes the ordered nature of DR severity.

A. Architecture Overview

The model follows the standard Latent Diffusion framework (VAE + Diffusion):

Latent Autoencoder: Images are compressed into a lower-dimensional latent space ( $z$ ) using a Variational Autoencoder (VAE).
Denoising Process: A U-Net denoiser ( $\epsilon_\theta$ ) is trained to predict noise added to the latent over $T$ timesteps.

B. Key Innovations

1. Ordinal Disease Conditioning:
Instead of standard one-hot encoding (which treats classes as unrelated), the authors introduced two strategies to represent disease grade $c$ as a scalar or ordered vector:

Equidistant Margins: Disease stages are embedded on a 1D axis with equal spacing ( $c_i = i$ ).
Learned Margins: Relative spacings between stages are learned via a parameterized vector ( $c_i = \sum v_j^2$ ), enforcing positive increments and monotonicity.
These representations are passed through an MLP to create high-dimensional embeddings injected into the denoiser.

2. Dual-Conditioning Strategy (Structure vs. Pathology):
To improve visual realism and allow for controlled interpolation, the model separates anatomical structure from disease pathology:

Structural Embedding ( $s$ ): A separate encoder (ResNet-50 trained with contrastive learning) extracts the retinal anatomy (vessels, optic disc) independent of disease.
Guidance: During sampling, the model combines unconditional, disease-conditioned, and structure-conditioned predictions using guidance weights ( $w_c, w_s$ ). This allows the generation of images with specific disease severity while preserving the underlying retinal structure of a specific patient.

3. Training & Evaluation:

Dataset: EyePACS (127k images).
Metrics:
- Visual Realism: Fréchet Inception Distance (FID).
- Clinical Consistency: Quadratic Weighted Kappa (QWK) between the target stage and the prediction of a pre-trained DR classifier (ResNet-50 with CORAL loss) on generated images.

3. Key Contributions

First Reproducible Ordinal Diffusion Model: The paper presents the first reproducible latent diffusion model that explicitly encodes ordinal constraints and structural constraints for medical imaging.
Continuous Spectrum Learning: By using scalar disease representations, the model learns a continuous spectrum of disease progression rather than discrete, isolated categories.
Counterfactual Generation: The dual-conditioning strategy enables "counterfactual" image-to-image generation, where a healthy eye can be transformed into various stages of DR while retaining the patient's unique anatomical features.
Performance Gains: The approach outperforms standard categorical conditioning in both image quality and clinical consistency.

4. Results

Quantitative Performance (EyePACS Dataset):

FID (Realism): The ordinal model with equidistant margins and structural conditioning achieved the best FID scores for 4 out of 5 DR stages compared to the baseline.
- Example: For "No DR," FID dropped from 23 (Baseline) to 12 (Ordinal + Structure).
QWK (Consistency): The ordinal model significantly improved the agreement between generated images and their intended labels.
- Baseline QWK: 0.79
- Ordinal Model QWK: 0.87 (with equidistant margins and structural conditioning).

Qualitative Observations:

Morphological Accuracy: Generated images correctly reproduced core structures (optic disc, vessels) and disease-specific lesions (microaneurysms, hemorrhages, neovascularization) that scaled appropriately with severity.
Interpolation: When interpolating between discrete stages (e.g., generating images for "1.5" severity), the model produced images with mixed features, demonstrating a smooth transition rather than abrupt jumps.
Learned Margins Insight: While the "equidistant" model performed better quantitatively, the "learned margins" model revealed that the model perceives early disease transitions as more subtle and late-stage transitions (Severe to Proliferative) as more drastic.

5. Significance and Future Work

Clinical Relevance: This work bridges the gap between the continuous nature of biological disease and the discrete labels used in clinical datasets. It provides a robust method for generating synthetic data that respects the ordinal progression of pathology.
Data Augmentation: The ability to generate realistic, stage-specific data with controlled anatomical features can significantly improve the training of diagnostic AI models, especially for rare or late-stage conditions.
Limitations & Future Directions:
- Evaluation currently relies on automated metrics; future work requires expert clinical assessment.
- The model struggles slightly with the most advanced stage (Proliferative DR), showing some artifacts.
- Future extensions could apply this framework to longitudinal datasets to explicitly model temporal progression over time.

In summary, this paper demonstrates that treating disease labels as ordered scalars rather than independent categories significantly enhances the realism and clinical utility of generative diffusion models in ophthalmology.