Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

Imagine you are trying to teach a robot to understand human skin tones. You want the robot to be fair, accurate, and able to recognize everyone from the palest to the darkest skin, no matter where they are or what the lighting is like.

This paper is about the team's journey to build that robot, and they discovered that the old ways of teaching it were completely broken. Here is the story of their work, broken down into simple concepts.

1. The Problem: The "One-Size-Fits-None" Map

For a long time, scientists tried to describe skin tones using a medical map called the Fitzpatrick Scale. Think of this like a map that only has six colors: "White," "Light," "Medium," "Dark," "Darker," and "Black."

The problem? Human skin is like a rainbow, not a box of six crayons. This old map is too blurry. It's like trying to describe a sunset using only the words "orange" and "red." It misses all the beautiful pinks, purples, and deep browns in between. Because the map was bad, the robots (AI models) trained on it were also bad at telling people apart fairly.

2. The New Map: The "Monk" Scale

The authors decided to use a better map called the Monk Skin Tone (MST) Scale. Instead of 6 colors, this map has 10 distinct shades, ranging from very light to very dark. It's like upgrading from a 6-crayon box to a 64-crayon box. This allows for a much more precise description of real human skin.

3. The Missing Puzzle Piece: The "STW" Dataset

To teach the robot, you need a huge library of pictures to study. The problem was that no one had a big, public library of photos labeled with this new 10-color map. Most existing photo libraries were either:

Too small: Like trying to learn a language with only 10 words.
Private: Locked away in a vault so no one could check the work.
Biased: Mostly showing people with light skin, ignoring the rest of the world.

The Solution: The team built the Skin Tone in The Wild (STW) dataset.

The Size: They collected 42,000 photos of 3,500 different people.
The "In The Wild" part: These aren't photos taken in a perfect studio with professional lights. They are photos taken in the real world—on the street, in parks, in bad lighting. This is crucial because that's where the robot will actually be used.
The Quality: They didn't just guess the labels. They had experts look at the photos and agree on the skin tone, ensuring the "answer key" was correct.

4. The Experiment: Old Tools vs. New Brains

The team wanted to see if old-school computer tricks could solve this, or if they needed modern "Deep Learning" (AI that learns like a brain).

The Old Way (Classic Computer Vision): Imagine trying to sort marbles by color using a simple ruler and a flashlight. You measure the average color of a patch of skin.
- The Result: It failed miserably. In the real world, with shadows and weird lighting, this method was basically guessing. It was like trying to identify a song by humming one note; it just didn't work.
The New Way (Deep Learning / SkinToneNet): Imagine giving the robot a super-brain (a Vision Transformer) that can look at the whole picture, understand the context, and learn patterns.
- The Result: This worked incredibly well. The robot learned to see skin tones with nearly the same accuracy as the human experts who labeled the photos.

5. The "Cheating" Trap

One of the most important discoveries in this paper was about cheating.
In many AI tests, researchers accidentally let the robot "cheat" by showing it the same person in both the training (learning) and testing (exam) phases.

The Analogy: It's like giving a student a math test, then giving them the exact same test again with the same names on the questions. They get 100%, but they didn't learn math; they just memorized the names.
The Fix: The team made sure that if a person's photo was in the "learning" pile, none of their photos were in the "testing" pile. When they did this, the old "ruler and flashlight" methods collapsed to random guessing, while the new AI brain stayed strong.

6. The Reality Check: Auditing the World

Finally, the team used their new, super-accurate robot to check other famous photo libraries (like CelebA or VGGFace2) that the world uses to train AI.

The Shock: They found that these famous libraries are heavily biased. They are full of light-skinned people and almost completely empty of people with the darkest skin tones (the top of the 10-tone scale).
The Consequence: If you build a facial recognition system using those old libraries, it will work great for some people and fail miserably for others. The authors' tool can now act as a "fairness auditor" to catch these mistakes before they cause harm.

The Big Takeaway

This paper tells us three main things:

Stop using the old, blurry maps. We need a detailed, 10-step scale to describe skin tone accurately.
Old computer tricks don't work in the real world. You need modern AI (Deep Learning) to handle the messiness of real life.
We have a blind spot. The data we use to train AI is missing huge chunks of humanity. We need to fix our datasets to make technology fair for everyone.

The authors have made their data and code public, giving the world a new, fairer way to teach computers to see us all.

1. Problem Statement

Deep learning models often inherit biases from training data, leading to unfair outcomes in demographic classification. While fairness in gender and ethnicity is well-studied, fine-grained skin tone analysis remains a significant challenge due to:

Lack of Granular Data: Existing datasets often rely on the medical Fitzpatrick Skin Type (FST) scale (6 tones), which is designed for UV response (burning vs. tanning) rather than visual representativeness. It fails to capture the spectral complexity of human skin and lacks inclusivity for darker tones.
Data Scarcity & Privacy: Many existing datasets are small, private, or disease-focused, preventing reproducibility and deep learning (DL) training.
Methodological Flaws: Previous works often suffer from train-test leakage (where the same individual appears in both training and testing sets), leading to inflated accuracy metrics that do not generalize to "in-the-wild" (ITW) scenarios.
Algorithmic Limitations: Classic Computer Vision (CCV) pipelines (e.g., color histograms, thresholding) struggle with uncontrolled lighting and complex textures, often yielding near-random results in ITW environments.

2. Methodology

A. The Skin Tone in The Wild (STW) Dataset

The authors introduce STW, the first large-scale, open-access dataset specifically for skin tone analysis in unconstrained environments.

Scale: 42,313 images from 3,564 individuals.
Annotation Scale: Uses the 10-tone Monk Skin Tone (MST) scale, which offers higher granularity and better representativeness for underrepresented groups compared to FST.
Sources: Aggregated from seven public datasets (LFW, CelebA, FairFace, Casia Face Africa, etc.) to ensure demographic diversity, particularly including rare MST classes (1, 9, 10).
Annotation Protocol:
- Hierarchical: One primary expert labeled all 3,564 individuals. Two additional annotators labeled a stratified subset (1,000 individuals) for validation.
- Interface: A split-window UI showing gold-standard examples alongside the target image to minimize subjectivity.
- Quality Metrics: Achieved an 88% Off-by-One Accuracy (OOAcc) and an Intraclass Correlation Coefficient (ICC) of 0.939, indicating high inter-annotator agreement despite the inherent subjectivity of skin tone perception.

B. Data Splitting & Leakage Prevention

To address the critical issue of identity leakage, the authors implemented two distinct splitting strategies:

Image Split (IMG): Randomly splits images. (Found to cause massive leakage and inflated accuracy).
Individual Split (IND): Ensures all images of a specific individual are exclusive to either the training, validation, or test set. This is the rigorous standard used for final evaluation.

C. Model Architectures

The paper benchmarks two distinct paradigms:

SkinToneCCV (Classic Computer Vision):
- Pipeline: Skin segmentation (Mediapipe) $\rightarrow$ Color descriptor extraction (histograms, Color Coherence Vectors, Border/Interior classification) $\rightarrow$ Re-binarization $\rightarrow$ Classification (Random Forest, SVM, KNN, etc.).
- Goal: To establish a baseline for traditional methods.
SkinToneNet (Deep Learning):
- Architecture: A fine-tuned Vision Transformer (ViT-Small) pretrained on ImageNet.
- Training Strategy: Full backbone fine-tuning on the balanced STW dataset (max 2 images per individual).
- Loss Function: Cross-entropy loss.
- Input: Full-face images (found to perform better than segmented skin regions due to contextual shape/texture cues).

3. Key Contributions

STW Dataset: A large-scale, open-access dataset with 42k+ images labeled on the 10-tone MST scale, addressing the lack of granular, representative data.
SkinToneNet: A state-of-the-art DL model capable of robust skin tone recognition in the wild, achieving near-annotator accuracy.
Rigorous Benchmarking: A comprehensive evaluation framework that exposes the failure of "Image Split" strategies and demonstrates that classic CV models fail to generalize in ITW scenarios.
Fairness Auditing Tool: A methodology to audit existing facial datasets (e.g., CelebA, VGGFace2) for skin tone distribution imbalances.

4. Results

Performance Comparison

Classic CV vs. Deep Learning:
- CCV Models: Performed poorly on out-of-domain data, with balanced accuracy (bAcc) near random chance (~0.33). They tended to overfit to frequent labels (MST 2 and 7).
- Deep Learning Models: Significantly outperformed CCV. SkinToneNet (ViT-Small) achieved:
  - STW Test: 0.449 bAcc / 0.901 wOOAcc.
  - MSTE (Out-of-Domain): 0.413 Acc / 0.853 OOAcc.
  - CCv2 (Out-of-Domain): 0.250 Acc / 0.706 OOAcc.
Generalization: DL models showed a 10–20% improvement in accuracy and 30–60% improvement in OOAcc compared to baselines on out-of-domain datasets.

Impact of Splitting Strategy

When using the Image Split (IMG), models showed artificially high accuracy (e.g., DenseNet121 ~0.90 wOOAcc).
When switching to the Individual Split (IND) to prevent leakage, accuracy dropped significantly (e.g., DenseNet121 ~0.87 wOOAcc), revealing that previous high scores were due to memorizing identities rather than learning skin tone features.

Dataset Auditing

Using SkinToneNet, the authors audited widely used facial datasets (FACET, FairFace, CelebA, VGGFace2, etc.).

Finding: Almost all datasets exhibit a severe imbalance, with a high absence of darker skin tones (MST classes 6–10).
Implication: Even datasets marketed for "fairness" (like FairFace) lack sufficient representation of darker skin tones, hindering true fairness assessment.

5. Significance and Conclusion

This work provides a critical framework for skin tone fairness. It demonstrates that:

Classic CV is insufficient for in-the-wild skin tone classification; Deep Learning is required.
Data leakage is a pervasive issue in fairness research; individual-based splitting is mandatory for valid evaluation.
Current datasets are biased against darker skin tones, necessitating the use of tools like SkinToneNet to audit and correct these imbalances before deploying models.

The authors explicitly state that SkinToneNet is intended strictly for auditing datasets and models to detect bias. They discourage its use for biometric profiling, surveillance, or automatic categorization of individuals without consent. The code and data are made available to foster reproducible research in algorithmic fairness.