Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation

This paper introduces a novel data-centric benchmark, a new public dataset, and two advanced techniques that leverage model uncertainty, prediction consistency, and representation analysis to effectively identify, quantify, and rank label noise in remote sensing image segmentation, outperforming existing baselines.

Keiller Nogueira, Codrut-Andrei Diaconu, Dávid Kerekes, Jakob Gawlikowski, Cédric Léonard, Nassim Ait Ali Braham, June Moh Goo, Zichao Zeng, Zhipeng Liu, Pallavi Jain, Andrea Nascetti, Ronny Hänsch

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Picture: The "Bad Recipe" Problem

Imagine you are a chef trying to teach a robot how to cook the perfect lasagna. You give the robot a cookbook (the dataset) and tell it, "Follow these steps exactly."

But there's a problem: The cookbook has typos.

  • Sometimes it says "add salt" when it should say "add sugar."
  • Sometimes the picture of the lasagna is blurry, and the robot can't tell where the cheese ends and the sauce begins.
  • Sometimes whole pages are missing.

In the world of remote sensing (taking pictures of Earth from space), this "cookbook" is a map of buildings and roads. Humans draw these maps to teach computers what a "building" looks like. But humans get tired, make mistakes, or use automated tools that aren't perfect. These mistakes are called Label Noise.

If you teach a robot using a messy cookbook, the robot will learn bad habits. It might think a tree is a house, or it might miss a whole street.

The Paper's Solution: The "Quality Inspector"

This paper introduces a new way to fix the problem. Instead of trying to fix the robot's brain (the AI model), the authors decided to fix the cookbook.

They created a Benchmark (a standardized test) and a new Dataset (a giant pile of images) to see which method is best at finding the "bad pages" in the cookbook.

Think of it like a Quality Control Inspector for a factory. Their job isn't to build the car; their job is to look at the pile of parts and say, "These 50 parts are perfect, but these 100 parts are bent and rusty. Let's throw the rusty ones away and only use the good ones."

How They Did It: The "Noise Injection" Game

To test their inspectors, the authors had to create a controlled disaster.

  1. The Clean Base: They started with a perfect, high-quality map of buildings in Louisiana and Germany (from the SpaceNet8 dataset).
  2. The Sabotage: They deliberately messed up the maps using seven different "tricks":
    • Shrinking/Expanding: Making buildings look too small or too big.
    • Rotating: Turning buildings sideways.
    • Deleting: Erasing parts of a building.
    • Fake Additions: Gluing a piece of a different building onto the map.
  3. The Challenge: They gave these "sabotaged" maps to different AI methods and asked: "Can you look at this messy map and tell me how bad the mistakes are? Rank them from 'Slightly Messy' to 'Total Disaster'."

The Two Winning Strategies

Two teams (or methods) won the challenge. Here is how they thought:

1. The "Committee of Experts" (Augmented Ensemble)
Imagine you have 10 different art critics looking at a painting.

  • They all look at the image from slightly different angles (using data augmentation).
  • They all try to guess what the building should look like.
  • If the 10 critics all agree with the messy map, the map is probably okay.
  • If the critics are confused and disagree with the map, the map is probably full of errors.
  • The Score: The more the experts disagree with the map, the "noisier" (worse) the map is rated.

2. The "Stress Test" (Regularized Variance)
This method is like a stress test for a bridge.

  • They train a group of 8 different AI models.
  • They ask all 8 models to predict what the building looks like.
  • If the models are confident and agree with each other, but the map says something different, the map is bad.
  • If the models are confused (high variance) and the map is also wrong, it's a double confirmation that the data is noisy.
  • The Score: They combine how wrong the prediction is with how confused the models are to give a final "Noise Score."

The Results: Why Less is More

The most surprising and useful finding of this paper is what happened when they actually used the robots to learn.

  • The Old Way: Train the robot on all the data, even the messy parts.
  • The New Way: Use the "Quality Inspector" to find the top 50% of the cleanest maps, throw away the bottom 50% (the noisy ones), and only train the robot on the good stuff.

The Result? The robot trained on less data actually performed better than the robot trained on all the data.

It's like studying for a test. If you read a textbook that has 50% typos and try to memorize everything, you will fail. But if you find a friend to help you identify the correct pages and you only study those, you will get an A, even though you studied fewer pages.

Why This Matters for the Real World

  1. Saving Money: Annotating (drawing) maps is expensive. If we can automatically find the bad drawings, we don't need to pay humans to re-draw them. We can just throw them away.
  2. Better AI: By filtering out the "garbage" data, the AI learns faster and makes fewer mistakes when looking at real satellite images.
  3. A New Standard: This paper provides a public "playground" (a dataset and a benchmark) so other scientists can test their own "Quality Inspectors" to see if they can do even better.

In a Nutshell

This paper says: "Don't just try to build a smarter robot; build a better filter for the data you feed it." By ranking images based on how "noisy" they are, we can throw away the bad data and build smarter, more reliable AI systems for mapping our world.