Apparent Age Estimation: Challenges and Outcomes

Imagine you walk into a room and instantly guess someone's age based on how they look. Maybe they have wrinkles, maybe their skin is glowing, or maybe they just have a "young at heart" vibe. That guess is what computer scientists call Apparent Age Estimation.

This paper is like a report card for a group of computer programs (AI models) trying to make that same guess. The researchers from De La Salle University wanted to see if these programs are good at guessing, but more importantly, if they are fair to everyone, regardless of their race or gender.

Here is the story of their findings, broken down into simple concepts:

1. The Problem: The AI is Biased

Think of the AI models as students taking a test. To study for the test, they were given a massive stack of flashcards (datasets) featuring mostly famous people from movies and Wikipedia.

The Issue: The flashcards were heavily skewed. They had way more pictures of white men than anyone else.
The Result: The AI became a "whiz kid" at guessing the age of white men but struggled terribly when looking at Asian or African American women. It was like a student who studied only for math but was suddenly asked to solve poetry problems—they just didn't have the right tools.

2. The Experiment: Trying New Study Methods

The researchers tried to fix this by teaching the AI three different ways to learn:

Method A (The Old Way): Just memorizing the answers (Cross-Entropy Loss).
Method B (The Statistical Way): Looking at the "average" and how much the answers usually vary (Mean-Variance Loss).
Method C (The Two-Step Way): First guessing a rough age, then making a small correction to get it right (Adaptive Mean-Residue Loss or AMRL).

The Winner: Method C (AMRL) was the smartest student. It got the most accurate guesses overall.

3. The Twist: Accuracy vs. Fairness

Here is where it gets tricky. Even though Method C was the most accurate on average, it still had blind spots.

The "Eye" Test: The researchers used a special tool called a "saliency map" (think of it as a heat map that shows where the AI is looking).
- When looking at a white male, the AI correctly focused on the eyes and mouth.
- When looking at an Asian or African American woman, the AI got confused. It started looking at the neck, the forehead, or even the background! It was like a detective looking at the wrong clues.
The Trade-off: They found that if they taught the AI using a more diverse set of photos (the FairFace dataset), the AI became much fairer. It stopped making huge mistakes for specific groups, even if its overall average score dropped slightly. It was like a teacher deciding to grade everyone fairly, even if it meant the top student's score went down a little.

4. Why Should You Care? (The Real World)

You might think, "So what if a computer guesses age wrong?" But this technology is already being used in the real world:

Cosmetics: Brands use it to recommend skincare. If the AI thinks a 30-year-old Asian woman is 50 because it's biased, it might sell her the wrong anti-aging cream.
Security: Banks use it to stop fraud. If the AI thinks a teenager looks like an adult (or vice versa) because of their race, it could deny a legitimate customer service or let a fraudster through.
The Philippines Context: The authors point out that most of these AI models are trained on Western data. If you use them in the Philippines, they are like a tourist trying to navigate Manila using a map of New York—they will get lost and make mistakes.

5. The Bottom Line

The paper concludes with a simple message: You can't just fix the math; you need to fix the data.

To make AI that is both smart and fair, we need to:

Stop using only Western faces to train these models.
Create local datasets (like photos of Filipino celebrities) so the AI learns what our faces look like as they age.
Be careful with privacy, because facial data is sensitive.

In a nutshell: The AI is getting better at guessing age, but it's still a bit racist and sexist because it was taught by a biased teacher. To fix it, we need to give it a more diverse classroom and teach it to look at the whole picture, not just the parts it's used to seeing.

1. Problem Statement

The paper addresses Apparent Age Estimation, a computer vision task focused on predicting a person's perceived age rather than their chronological birth age. While this technology offers significant commercial value for personalization in sectors like cosmetics, marketing, and healthcare, current state-of-the-art models suffer from two critical issues:

Demographic Bias: Models trained on existing datasets exhibit significant performance disparities across race and gender, often underperforming on Asian and African American populations.
Data Imbalance: Major datasets (e.g., IMDB-WIKI, APPA-REAL) are heavily skewed toward Caucasian and male subjects, leading to learned representations that do not generalize well to diverse demographics.

2. Methodology

The authors conducted a comparative analysis of three distinct loss functions applied to a Deep Expectation (DEX) architecture (based on VGG-16). They evaluated these models across various dataset combinations to assess both accuracy and fairness.

2.1 Models and Loss Functions

The study compared the following approaches:

Deep Expectation (DEX): Uses standard Cross-Entropy Loss (CEL) with multinomial regression to predict age distributions (ages 1–101).
Mean-Variance Loss (MVL): A distribution learning approach that jointly optimizes a mean loss (minimizing distance to ground truth) and a variance loss (penalizing spread to ensure sharp predictions).
Adaptive Mean-Residue Loss (AMRL): A two-step mechanism that first estimates a coarse age and then adaptively calculates a residual value to refine the prediction toward the ground truth.

2.2 Datasets and Experimental Design

The researchers utilized four primary datasets:

IMDB-WIKI: A large-scale dataset of celebrity images (noted for gender imbalance and noisy labels).
CLAP: A crowdsourced dataset with both real and apparent age annotations.
APPA-REAL: A dataset with real/apparent age and demographic labels (heavily skewed toward Caucasians).
FairFace: A dataset designed for better demographic balance (though still lacking in older age groups).

Training Strategy:
The authors fine-tuned models on six different dataset combinations (e.g., IMDB-WIKI $\to$ APPA-REAL; IMDB-WIKI $\to$ FairFace $\to$ APPA-REAL) to determine the impact of diverse data on model performance.

2.3 Evaluation Metrics and Analysis

Accuracy: Measured via Mean Absolute Error (MAE) and $\epsilon$ -error (accounting for uncertainty).
Fairness: Performance was stratified by race (Caucasian, Asian, African American) and gender.
Visualization:
- UMAP Embeddings: To visualize how well models cluster different age groups in latent space.
- Saliency Maps: To identify which facial regions the model focuses on during prediction.
- Cosine Similarity: To measure the alignment between image embeddings and average age group embeddings.

3. Key Contributions

Comprehensive Benchmarking: Evaluated DEX, MVL, and AMRL across multiple dataset combinations, reproducing and extending previous results.
Demographic Bias Quantification: Systematically identified that models exhibit the highest error rates on Asian and African American females, while performing best on African American males.
Fairness vs. Accuracy Trade-off: Demonstrated that while AMRL achieves the highest overall accuracy, incorporating the FairFace dataset during fine-tuning significantly reduces performance variance across demographics, even if it slightly lowers peak accuracy.
Localized Validation: Tested models on a small, self-annotated dataset of Filipino celebrities to highlight the failure of Western-trained models on Southeast Asian features.
Ethical Framework: Provided strategic recommendations for the Philippine context, emphasizing the need for localized datasets and adherence to data privacy laws (e.g., Data Privacy Act of 2012).

4. Key Results

Best Accuracy: The AMRL method combined with IMDB-WIKI + APPA-REAL achieved the lowest Mean Absolute Error (MAE) of 3.59 on the APPA-REAL test set, outperforming MVL and CEL.
Best Fairness: Models fine-tuned with FairFace (e.g., IMDB-WIKI $\to$ FairFace $\to$ APPA-REAL) showed the lowest variance in MAE across different race-gender groups. This suggests FairFace acts as a crucial "equity layer" to correct biases learned from IMDB-WIKI.
Performance Disparities:
- Worst Performance: Asian and African American females consistently had the highest MAE (e.g., up to 4.65 MAE for Asian females in MVL models).
- Best Performance: African American males often had the lowest error rates.
Visual Analysis:
- UMAP: AMRL produced the most distinct and well-defined clusters for age groups, particularly separating young and old subjects.
- Saliency Maps: Despite high accuracy, models showed inconsistent feature focus. For certain demographics (e.g., African American females), models focused on peripheral areas (forehead, neck) rather than central facial features, leading to unreliable predictions.
Localized Failure: On the Filipino celebrity dataset, the best AMRL model still yielded an MAE of 6.82, significantly higher than on Western datasets, confirming the lack of generalizability to Southeast Asian faces.

5. Significance and Implications

Technical Insight: The study proves that technical improvements in loss functions (like AMRL) alone are insufficient to solve demographic bias. Data composition is the primary driver of fairness; integrating diverse datasets like FairFace is essential for equitable performance.
Business Impact: In industries like cosmetics and KYC (Know Your Customer), biased models lead to:
- Inaccurate product recommendations for non-Caucasian users.
- Increased false fraud flags and operational delays in financial services.
- Erosion of brand trust due to perceived prejudice.
Ethical & Legal: The paper highlights critical risks in the Philippines, including violations of the Data Privacy Act of 2012 due to the use of sensitive facial data without proper governance, and the reinforcement of historical prejudices regarding beauty standards.
Future Directions: The authors propose:
- Developing localized longitudinal datasets for Filipino populations.
- Using Contrastive Learning for few-shot learning to improve performance on underrepresented groups.
- Optimizing Mixture-of-Experts (MoE) architectures to specialize models for specific demographics while maintaining computational efficiency.

Conclusion

The paper concludes that while AMRL is the superior method for apparent age estimation in terms of raw accuracy, achieving fairness requires a deliberate strategy of data diversification. Technical solutions must be paired with localized data collection and strict ethical governance to ensure these systems are viable and equitable for global, particularly Southeast Asian, populations.