Managing Cognitive Bias in Human Labeling Operations for Rare-Event AI: Evidence from a Field Experiment

This paper demonstrates through a field experiment on a medical crowdsourcing platform that balancing feedback prevalence and using probabilistic elicitation, followed by linear-in-log-odds recalibration, effectively mitigates cognitive biases in human labeling of rare events, thereby significantly improving the reliability of downstream AI models.

Gunnar P. Epping, Andrew Caplin, Erik Duhaime, William R. Holmes, Daniel Martin, Jennifer S. Trueblood

Published Fri, 13 Ma
📖 5 min read🧠 Deep dive

Imagine you are the manager of a massive security checkpoint at an airport. Your job is to spot a very specific, dangerous item hidden in luggage. The problem? This dangerous item appears in only 1 out of every 5 bags (20%). The other 4 bags are perfectly safe.

This is the challenge of "Rare-Event AI." Whether it's spotting a rare disease in a blood sample, finding fraud in a bank transaction, or detecting a defect in a factory, the "bad" things are rare, and the "good" things are common.

This paper explores a hidden trap that happens when humans (or the AI models they train) try to find these rare items. It turns out, our brains have a sneaky bias called the "Prevalence Effect."

Here is the story of how the researchers fixed it, explained simply.

The Trap: The "Bored Guard" Effect

When a security guard sees 80% safe bags and only 20% dangerous ones, their brain starts to get lazy. They think, "Most bags are safe, so I'll just assume this one is safe too."

  • The Result: They miss the dangerous bags (False Negatives) way too often.
  • The AI Problem: If you hire 100 guards to label these bags, and they all get bored and say "Safe," you don't get 100 correct answers. You get 100 wrong answers that all agree with each other. When you feed these wrong labels into a computer program (AI), the AI learns to be lazy too. It thinks, "Oh, everything is safe," and misses the real dangers.

The Experiment: A Medical "Game Show"

The researchers ran a real-world experiment using a platform called DiagnosUs, where people play a game to identify cancerous cells (called "blasts") in blood images.

They wanted to see if they could "hack" the human brain to make it better at spotting the rare cells. They tested three different tricks:

Trick 1: The "Balanced Training" (Changing the Feedback)

Imagine the game show has two types of questions:

  1. Real Questions: The actual blood images the players need to label (20% are cancerous).
  2. Practice Questions: Images the players see only to get feedback on whether they are right or wrong.
  • The Old Way: The practice questions were also 20% cancerous. The players got bored and started guessing "No cancer" all the time.
  • The New Way: The researchers made the practice questions 50% cancerous.
  • The Analogy: It's like a coach telling a basketball player, "In the real game, you only shoot free throws 20% of the time. But in practice, we are going to make you shoot free throws 50% of the time so you don't get lazy."
  • The Result: The players stayed alert. They stopped guessing "No" so often and started catching more of the rare cancer cells.

Trick 2: The "Maybe" Button (Asking for Probabilities)

Instead of asking players to just click "Yes" or "No," the researchers asked them to slide a bar and say, "I'm 70% sure this is cancer" or "I'm 10% sure."

  • The Analogy: Asking for a "Yes/No" is like asking a weather forecaster, "Will it rain?" (Yes/No). Asking for probability is asking, "What is the chance of rain?" (30%, 80%, 99%).
  • The Result: Even without changing the game rules, asking for a "confidence score" helped. It gave the system more information. When the crowd's "Maybe" votes were averaged, they were much better at spotting the rare cells than simple "Yes/No" votes.

Trick 3: The "Math Fix" (Recalibration)

Even with the best players, humans still make mistakes. Sometimes they are too confident, sometimes too shy. The researchers added a final step: Recalibration.

  • The Analogy: Imagine a thermometer that always reads 5 degrees too cold. You don't throw the thermometer away; you just add a sticker that says, "Add 5 degrees to whatever it says."
  • The Result: They used a mathematical formula to look at the players' answers and the known "practice" answers, then adjusted the final scores. This "sticker" fixed the systematic bias. It turned a "Maybe" that was actually a "Yes" into a clear "Yes."

The Grand Finale: Does the AI Learn?

The researchers took all these different sets of labels (the "Yes/No" ones, the "Maybe" ones, and the "Math-Fixed" ones) and taught a computer (an AI) to recognize the cells.

  • The Bad News: The AI trained on the "lazy" labels missed almost as many cancers as the humans did.
  • The Good News: The AI trained on the "Math-Fixed" labels became a superhero. It missed far fewer cancers and was much more reliable.

The Takeaway for the Real World

This paper teaches us that when we are looking for rare, dangerous things (like fraud or disease), we cannot just hire more people and hope for the best. If the environment makes them lazy, more people just means more lazy people.

To fix this, organizations need to:

  1. Change the training: Don't let the "practice" data look exactly like the boring real world. Mix it up to keep people alert.
  2. Ask for nuance: Don't just ask "Yes/No." Ask "How sure are you?"
  3. Do a math check: Use a simple formula to correct the group's bias before training the AI.

In short: You can't just build a better AI algorithm; you have to build a better human labeling process first. If you fix the human game, the AI wins.