Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation

The Big Picture: The "Bad Recipe" Problem

Imagine you are a chef trying to teach a robot how to cook the perfect lasagna. You give the robot a cookbook (the dataset) and tell it, "Follow these steps exactly."

But there's a problem: The cookbook has typos.

Sometimes it says "add salt" when it should say "add sugar."
Sometimes the picture of the lasagna is blurry, and the robot can't tell where the cheese ends and the sauce begins.
Sometimes whole pages are missing.

In the world of remote sensing (taking pictures of Earth from space), this "cookbook" is a map of buildings and roads. Humans draw these maps to teach computers what a "building" looks like. But humans get tired, make mistakes, or use automated tools that aren't perfect. These mistakes are called Label Noise.

If you teach a robot using a messy cookbook, the robot will learn bad habits. It might think a tree is a house, or it might miss a whole street.

The Paper's Solution: The "Quality Inspector"

This paper introduces a new way to fix the problem. Instead of trying to fix the robot's brain (the AI model), the authors decided to fix the cookbook.

They created a Benchmark (a standardized test) and a new Dataset (a giant pile of images) to see which method is best at finding the "bad pages" in the cookbook.

Think of it like a Quality Control Inspector for a factory. Their job isn't to build the car; their job is to look at the pile of parts and say, "These 50 parts are perfect, but these 100 parts are bent and rusty. Let's throw the rusty ones away and only use the good ones."

How They Did It: The "Noise Injection" Game

To test their inspectors, the authors had to create a controlled disaster.

The Clean Base: They started with a perfect, high-quality map of buildings in Louisiana and Germany (from the SpaceNet8 dataset).
The Sabotage: They deliberately messed up the maps using seven different "tricks":
- Shrinking/Expanding: Making buildings look too small or too big.
- Rotating: Turning buildings sideways.
- Deleting: Erasing parts of a building.
- Fake Additions: Gluing a piece of a different building onto the map.
The Challenge: They gave these "sabotaged" maps to different AI methods and asked: "Can you look at this messy map and tell me how bad the mistakes are? Rank them from 'Slightly Messy' to 'Total Disaster'."

The Two Winning Strategies

Two teams (or methods) won the challenge. Here is how they thought:

1. The "Committee of Experts" (Augmented Ensemble)
Imagine you have 10 different art critics looking at a painting.

They all look at the image from slightly different angles (using data augmentation).
They all try to guess what the building should look like.
If the 10 critics all agree with the messy map, the map is probably okay.
If the critics are confused and disagree with the map, the map is probably full of errors.
The Score: The more the experts disagree with the map, the "noisier" (worse) the map is rated.

2. The "Stress Test" (Regularized Variance)
This method is like a stress test for a bridge.

They train a group of 8 different AI models.
They ask all 8 models to predict what the building looks like.
If the models are confident and agree with each other, but the map says something different, the map is bad.
If the models are confused (high variance) and the map is also wrong, it's a double confirmation that the data is noisy.
The Score: They combine how wrong the prediction is with how confused the models are to give a final "Noise Score."

The Results: Why Less is More

The most surprising and useful finding of this paper is what happened when they actually used the robots to learn.

The Old Way: Train the robot on all the data, even the messy parts.
The New Way: Use the "Quality Inspector" to find the top 50% of the cleanest maps, throw away the bottom 50% (the noisy ones), and only train the robot on the good stuff.

The Result? The robot trained on less data actually performed better than the robot trained on all the data.

It's like studying for a test. If you read a textbook that has 50% typos and try to memorize everything, you will fail. But if you find a friend to help you identify the correct pages and you only study those, you will get an A, even though you studied fewer pages.

Why This Matters for the Real World

Saving Money: Annotating (drawing) maps is expensive. If we can automatically find the bad drawings, we don't need to pay humans to re-draw them. We can just throw them away.
Better AI: By filtering out the "garbage" data, the AI learns faster and makes fewer mistakes when looking at real satellite images.
A New Standard: This paper provides a public "playground" (a dataset and a benchmark) so other scientists can test their own "Quality Inspectors" to see if they can do even better.

In a Nutshell

This paper says: "Don't just try to build a smarter robot; build a better filter for the data you feed it." By ranking images based on how "noisy" they are, we can throw away the bad data and build smarter, more reliable AI systems for mapping our world.

1. Problem Statement

Semantic segmentation in remote sensing relies heavily on high-quality, pixel-level annotations. However, generating these labels is labor-intensive, expensive, and prone to human error, leading to significant label noise in large-scale datasets (estimated between 8.0% and 38.5%).

The Challenge: Unlike image classification where noise is binary (correct/incorrect), noise in semantic segmentation is continuous and heterogeneous. Within a single image, some regions may be accurate while others are spatially misaligned, partially missing, or contain false positives.
The Gap: Existing methods often focus on making models robust to noise (e.g., loss function modification) rather than identifying and quantifying the noise itself. Furthermore, there is a lack of standardized benchmarks for estimating and ranking the severity of label noise in remote sensing segmentation, particularly from a "Data-Centric" perspective.

2. Methodology

The paper proposes a Data-Centric Benchmark that reframes label noise estimation as a ranking problem. Instead of a binary classification of "clean vs. noisy," the goal is to rank training images from least to most affected by annotation errors.

A. The Dataset (SpaceNet8-based)

Source: Derived from the high-resolution SpaceNet8 dataset (flood monitoring in Louisiana and Germany).
Task: Binary segmentation (Buildings vs. Background) on 256x256 patches.
Composition: 5,000 training samples and 1,298 validation/test samples.
Noise Synthesis: To create a ground-truth ranking, seven specific types of synthetic noise were injected into the clean masks:
1. Global shrink/expansion.
2. One-sided shrink/expansion.
3. Moderate rotation.
4. Small translation.
5. Deletion (removing parts of buildings).
6. Vertex addition (distorting shapes).
7. False positive addition (inserting synthetic buildings).
Ground Truth: The "noise level" for each image is quantified by the pixel-wise Intersection-over-Union (IoU) between the clean reference mask and the noisy mask. Lower IoU = Higher Noise.

B. Proposed Approaches (Top Performers)

Two novel methods were developed to estimate noise levels and rank the dataset:

Augmented Ensemble Ranking:
- Architecture: Based on RefineNet, fine-tuned on the dataset.
- Strategy: Trains an ensemble of 10 models ( $K=10$ ) using strong data augmentations (geometric and appearance transformations) to encourage generalization.
- Scoring: Uses majority voting for predictions. The noise score for an image is calculated as $1 - \text{IoU}$ between the ensemble's majority vote prediction and the provided (noisy) label.
- Logic: High disagreement between the robust ensemble and the provided label indicates high noise.
Regularized Variance Ranking:
- Architecture: Uses a ScaleMAE encoder and UperNet decoder.
- Strategy: Trains an ensemble of 8 models ( $K=8$ ) with progressive L2 regularization to prevent overfitting to noise.
- Scoring: Calculates a score $S_i$ for each image:
  $S_i = \text{IoU}_i - (0.5 - \text{IoU}_i) \times \text{avg}(\text{var}_k(\hat{y}_{i,k}))$
  Where $\text{IoU}_i$ is the best IoU from the ensemble, and $\text{var}_k$ is the pixel-wise variance across the ensemble predictions.
- Logic: This metric penalizes images that have high variance (model uncertainty) combined with low agreement (low IoU), identifying regions where the model is confused and the label is likely wrong.

3. Key Contributions

Novel Benchmark: Introduction of the first standardized benchmark for label noise ranking in remote sensing semantic segmentation. It provides a unified framework to compare data-centric strategies.
Public Dataset: Release of a new dataset with 5,000 training samples containing controlled, synthetic noise types, alongside clean ground-truth masks for evaluation.
Ranking Formulation: Shifts the paradigm from binary noise detection to continuous ranking, acknowledging that noise exists on a spectrum.
Open Source: All code, data, and the two winning approaches are publicly available.

4. Experimental Results

The methods were evaluated using two protocols:

Ranking Accuracy: Comparing the predicted noise ranking against the ground-truth IoU ranking.
Downstream Impact: Training segmentation models (U-Net and SegFormer) on subsets of data selected by the ranking methods to see if filtering noise improves performance.

Key Findings:

Ranking Performance: Both proposed methods significantly outperformed traditional baselines (CleanLab and Uncertainty Quantification).
- Augmented Ensemble Ranking achieved the best results: Kendall's $\tau$ = 0.61 and Spearman's $\rho$ = 0.77.
- Regularized Variance Ranking followed closely with $\tau$ = 0.56 and $\rho$ = 0.73.
- Baselines scored significantly lower (e.g., CleanLab $\tau \approx 0.17$ ).
Downstream Performance:
- Training on the top 50% of ranked (least noisy) samples consistently yielded higher F1-scores than training on the full noisy dataset.
- For U-Net, using the top 50% selected by the Augmented Ensemble method achieved 80.34% F1, compared to 78.26% when using the full noisy set.
- This confirms that data curation (filtering noise) is more effective than simply training on more data if that data is noisy.
Noise Type Analysis: The methods were most effective at detecting deletions and false positives. They struggled slightly more with geometric distortions (shrink/expand) due to boundary ambiguity.

5. Significance

Data-Centric AI: The work validates the hypothesis that improving data quality (via noise ranking and filtering) is often more effective than architectural changes for improving model performance in remote sensing.
Cost Efficiency: By identifying the "cleanest" subset of data (e.g., top 50%), practitioners can reduce training time and computational costs while achieving better generalization.
Targeted Relabeling: The ranking system allows for efficient allocation of limited annotation budgets, focusing human effort only on the samples identified as most noisy.
Foundation for Future Research: The benchmark provides a necessary baseline for developing "Confident Learning" and noise-aware algorithms specifically for the complex, pixel-level challenges of remote sensing.