Multiscale Softmax Cross Entropy for Fovea Localization on Color Fundus Photography

The Big Picture: Finding the "Center of the Eye"

Imagine you are looking at a high-resolution photo of the back of someone's eye (called a fundus image). In the middle of this colorful, vein-filled landscape, there is a tiny, critical spot called the fovea. This is the "bullseye" of your vision—the place where you see things most clearly.

Doctors need to find this spot automatically using computers to diagnose diseases like glaucoma or macular degeneration. The problem? It's a needle in a haystack, and the computer needs to guess the exact X and Y coordinates of that needle.

The Old Way vs. The New Way

To teach a computer to find this spot, researchers usually use one of two methods. The authors of this paper decided to mix them up to get the best of both worlds.

1. The "Ruler" Method (Regression/MSE)

The Analogy: Imagine you are playing a game of "Hot and Cold." You ask the computer, "How far off is your guess?"
How it works: If the computer guesses the fovea is 10 pixels away, it gets a small penalty. If it guesses 1 pixel away, it gets a tiny penalty.
The Flaw: The penalty is too gentle. The computer thinks, "Oh, being 5 pixels off is almost as good as being 1 pixel off." It doesn't feel enough pressure to be perfect.

2. The "Multiple Choice" Method (Classification/Softmax)

The Analogy: Imagine a giant multiple-choice test where every single pixel on the screen is an answer option. The computer has to pick exactly one button to press.
How it works: If the computer picks the wrong button, it gets a massive penalty, no matter how close that button was to the right one.
The Flaw: It's too harsh. If the computer picks a button that is right next to the correct one, it gets punished just as hard as if it picked a button on the opposite side of the screen. It doesn't learn the nuance of "getting close."

The Solution: The "Zoom Lens" Approach (MSCE)

The authors, Yuli Wu and her team, created a new method called Multiscale Softmax Cross Entropy (MSCE).

Think of this like looking at a map through a set of zoom lenses:

Zoomed Out (Wide Angle): You look at the whole map. You can tell the fovea is in the "North" section. This is a coarse guess.
Zoomed In (Medium): You look closer. Now you know it's in the "North-East" corner.
Zoomed In (Close-up): You see the specific street.
Zoomed In (Macro): You see the exact house number.

How MSCE works:
Instead of just asking the computer to guess the final answer, the MSCE method asks the computer to make guesses at all these different zoom levels simultaneously.

It checks the "wide angle" guess.
It checks the "medium" guess.
It checks the "close-up" guess.

It then combines the penalties from all these levels. This teaches the computer:

"You were too far off in the wide-angle view (big penalty)."
"You were closer in the medium view (medium penalty)."
"You were almost right in the close-up view (small penalty)."

This creates a "smooth path" for the computer to follow, guiding it gently but firmly toward the exact center, rather than just punishing it randomly.

The Experiment: The "Eye" Test

The team tested this on a database of 1,200 eye images (called REFUGE2). They compared their new "Zoom Lens" method against the old "Ruler" method and the old "Multiple Choice" method.

The Results:

The Ruler method (MSE) was okay, but not great.
The Multiple Choice method (Softmax) was better than the ruler, but still struggled with precision.
The Zoom Lens method (MSCE) was the winner. It found the fovea more accurately than both previous methods.

Why Does This Matter?

Better Diagnosis: If a computer can find the center of the eye more accurately, it can better measure how diseases are spreading or how much damage has been done.
A New Tool for AI: This paper suggests that we can use "classification" tricks (usually used for sorting things into buckets) to solve "regression" problems (finding exact numbers/coordinates). It's like using a hammer to drive a screw because you found a special adapter that makes it work perfectly.

The "Oops" Moment

The paper admits it's not perfect yet. Sometimes, if the fovea is hidden in a dark corner or looks very similar to another part of the eye (like the optic disc), the computer still gets confused. But, the authors believe that by tweaking the "weights" of their zoom lenses (the math behind the scenes), they can fix these errors.

Summary

The authors built a smarter way for computers to find the center of the eye. Instead of just guessing a number or picking a single button, they taught the computer to look at the image through multiple zoom levels at once. This helps the computer understand how close it is to the right answer, leading to much more accurate medical diagnoses.

1. Problem Statement

The paper addresses the task of fovea localization in color fundus photography, a critical step in the computer-aided diagnosis of retinal diseases. The fovea (fovea centralis) is the anatomical center of the macula lutea.

The Challenge: Traditional approaches treat coordinate prediction as a regression problem, typically using Mean Squared Error (MSE) or Mean Absolute Error (MAE) losses.
The Limitation: Regression losses (MSE/MAE) punish incorrect predictions that are close to the ground truth less severely than those far away. In contrast, standard classification losses (like Softmax Cross Entropy) treat all incorrect predictions equally, regardless of their proximity to the true label. This creates a "functional gap" where regression lacks the discriminative power of classification, and classification lacks the nuance of spatial proximity.

2. Methodology

The authors propose a novel approach that reframes the coordinate regression problem as a classification task and introduces a new loss function to bridge the gap between regression and probabilistic losses.

A. Problem Reformulation

Instead of directly regressing continuous $x$ and $y$ coordinates, the authors treat the $x$ -axis and $y$ -axis coordinates as separate classification targets. The image space is discretized into classes (e.g., a 256-dimensional vector representing possible coordinate positions).

B. Network Architecture

Backbone: The model utilizes a modified U-Net architecture (specifically the Cellpose network), which includes residual connections within convolutional blocks and a style vector fused into the upsampling pathway.
Input: Color fundus images resized to $256 \times 256$ .
Feature Extraction: The network outputs a feature map of the same size as the input. This map is pooled multiple times to generate multiscale branches.
Reduction: Each branch is reduced per axis (via summation) to produce logit vectors for the $x$ and $y$ coordinates.

C. The Core Innovation: Multiscale Softmax Cross Entropy (MSCE)

The paper introduces MSCE to combine the benefits of regression (distance sensitivity) and classification (probabilistic confidence).

Mechanism: MSCE calculates a weighted summation of Softmax Cross Entropy (SCE) losses across multiple scales (downsampled feature maps).
Formula:
$MSCE = \sum_{m=1}^{M} \lambda_m \cdot \left( - \sum_{i=1}^{C_m} t_i \log \left( \frac{e^{s_i}}{\sum_{j=1}^{C_m} e^{s_j}} \right) \right)$
Where $M$ is the number of multiscales, $\lambda_m$ are weights (set to 1 in this work), $s$ are predicted logits, and $t$ are ground-truth labels.
Rationale:
- MSE gradually attracts wrong predictions to the ground truth.
- Standard SCE rejects all wrong predictions equally.
- MSCE aims to neutralize these traits: it distinguishes predictions in a "stepwise regressive manner" (via the multiscale hierarchy) while strongly encouraging convergence to the single ground truth. Theoretically, setting the number of scales $M$ to the maximum (e.g., $M=8$ for 256 classes) best approximates the desired behavior.

3. Key Contributions

Reframing Localization: Successfully treating coordinate regression as a classification problem using Softmax Cross Entropy.
MSCE Loss Function: Proposing a Multiscale Softmax Cross Entropy loss that leverages feature maps at different resolutions to provide a more robust gradient signal than vanilla SCE or standard MSE.
Empirical Validation: Demonstrating that probabilistic losses can outperform traditional regression losses in coordinate localization tasks when combined with multiscale feature aggregation.

4. Experimental Results

Dataset: REFUGE2, containing 1200 training images and 400 test images.
Metric: Reciprocal of the Average Euclidean Distance (R-AED), defined as $1 / (d(p, q) + 0.1)$ , where $d$ is the Euclidean distance between ground truth $p$ and prediction $q$ . Higher is better.
Key Findings (Ablation Study):
- Pooling Strategy: Using MaxPooling with sum reduction significantly outperformed AveragePooling with mean reduction.
- Loss Comparison:
  - MSE (Baseline): Achieved an R-AED of ~5.18–5.69 depending on settings.
  - Vanilla SCE: Improved over MSE (R-AED ~3.45–4.99) but was unstable in some configurations.
  - MSCE (Proposed): Achieved the best performance in specific configurations, reaching an R-AED of 6.12 (with MaxPooling/sum, batch size 8).
- Visual Analysis: Visualizations showed that MSCE predictions (white crosses) had smaller offsets from the ground truth compared to MSE (blue) and vanilla SCE (green). MSCE was particularly effective in avoiding large offsets, though it still struggled when the fovea was far from the center and blended into dark marginal areas.

5. Significance and Future Work

Novel Approach: The paper provides a novel alternative for coordinate regression tasks (e.g., bounding box detection, keypoint localization) by utilizing classification-based losses with multiscale features.
Clinical Relevance: Accurate fovea localization is vital for diagnosing retinal diseases. The method offers a more robust tool for Computer-Aided Diagnosis (CAD).
Future Directions:
- Hyperparameter Tuning: Fine-tuning the weights ( $\lambda_m$ ) in the MSCE formula to stabilize predictions.
- Anatomical Fusion: Integrating optic disc segmentation to leverage the known geometric relationship between the optic disc and the fovea.
- Multi-task Learning: Combining fovea localization with other ophthalmic tasks like vessel segmentation, optic cup segmentation, and disease grading (e.g., glaucoma).

In conclusion, the authors demonstrate that by treating coordinate localization as a multiscale classification problem, they can achieve superior accuracy compared to traditional regression methods, offering a promising new direction for medical image analysis.