Human, Algorithm, or Both? Gender Bias in Human-Augmented Recruiting

Imagine you are hiring a new employee for your company. You have three ways to find the right person:

The Old School Detective: You manually sift through a massive pile of resumes, reading every single one yourself.
The Robot Assistant: You ask a smart computer program to scan the pile and hand you the top 100 names it thinks are perfect.
The Dream Team: You let the Robot Assistant give you a shortlist first, and then you use your own judgment to review that list and find a few more people on your own.

This paper is a real-world experiment to see which of these three methods is the fairest when it comes to gender. Specifically, does one method accidentally favor men over women (or vice versa) more than the others?

Here is the breakdown of what the researchers found, using some simple analogies.

The Setup: A Giant Resume Library

The study took place at Jobindex, Denmark's biggest job site. They looked at over 58,000 job postings and nearly 1.3 million candidates over two years.

The Problem: They couldn't see the candidates' actual names or photos (to keep things fair), so they had to guess the gender based on the candidate's first name. It's like trying to guess if a mystery box contains a red or blue ball just by looking at the label on the box. Their guessing machine was 99% accurate.
The Goal: To see if the final list of people the recruiters actually contacted had a balanced mix of men and women, or if it was skewed.

The Three Scenarios & The Results

1. The Human Detective (Manual Search)

The Analogy: Imagine a librarian who has to find books in a giant library without a computer. They walk the aisles, pick up books, and decide which ones to recommend.
The Result: When recruiters searched manually, they did a decent job, but they still had a slight bias. They tended to "click" on and contact more men than women. However, the more time they spent thinking and reviewing, the fairer their list became. It was like a chef tasting the soup more times; the more they checked, the better the balance got.

2. The Robot Assistant (AI Only)

The Analogy: Imagine a robot that has read every resume in the library but learned from the past. If the past was biased (e.g., in the past, people mostly hired men for certain jobs), the robot might think, "Oh, men are the best fit!" and keep suggesting men.
The Result: The AI was actually less fair than the humans. It consistently suggested fewer women than men. This is likely because the AI was trained on old data where human recruiters had already been biased. The robot was just copying the mistakes of the past.

3. The Dream Team (Human + AI)

The Analogy: This is the magic sauce. Imagine the Robot hands you a shortlist of 100 candidates. You look at it, and then you go back to the library to find a few more people yourself.
The Result: This was the winner. The combination of AI and Human produced the fairest lists of all.

Why? It wasn't just "AI + Human = Good." It was more like "AI + Human = Super Good."
The "Inspiration" Effect: When recruiters looked at the AI's list first, it seemed to "wake them up." Even though the AI's list was biased, seeing it made the recruiters more aware. When they went back to search manually after seeing the AI list, they ended up finding a much more balanced mix of men and women than if they had searched manually from the start.

The Big Takeaway: "More Than the Sum of Its Parts"

The most surprising finding is that Human + AI is better than either one alone.

Think of it like a GPS and a local driver.

If you only use the GPS (AI), you might get stuck in a traffic jam because the map is outdated.
If you only use the local driver (Human), they might take a shortcut they know, but they might miss a better route they haven't seen in a while.
If you use both, the GPS gives you the big picture, and the driver adjusts based on the current reality. The result is a smoother, fairer ride.

In this study, the AI gave the recruiters a "nudge." Even though the AI's suggestions weren't perfect, looking at them made the humans more deliberate and careful in their final choices, leading to a more diverse group of candidates.

Other Interesting Nuggets

Job Types Matter: In jobs usually dominated by women (like childcare), recruiters actually contacted more men than the database suggested. In jobs dominated by men (like plumbing), they contacted more women. It seems recruiters might be subconsciously trying to "fix" the gender imbalance in specific fields.
Fairness Doesn't Hurt Quality: The researchers checked if being fair meant hiring "worse" candidates. They found that being fair didn't hurt the success rate. The people who were contacted responded positively to the job offers at the same rate, regardless of gender balance.

The Bottom Line

If you want to hire fairly, don't just rely on a robot, and don't just rely on a human. Use the robot to get a head start, but keep the human in the loop to make the final call. The combination creates a safety net that catches the biases of both the machine and the human, resulting in a much fairer hiring process.

Here is a detailed technical summary of the paper "Human, Algorithm, or Both? Gender Bias in Human-Augmented Recruiting" by Mesut Kaya and Toine Bogers.

1. Problem Statement

The rapid adoption of AI in HR technology (projected to reach $81.8 billion by 2032) has introduced significant risks regarding algorithmic bias. While AI promises efficiency, unmonitored use can amplify existing societal and historical biases against vulnerable groups, particularly regarding gender.

The Gap: While there is extensive research on algorithmic fairness and human bias separately, there is a scarcity of empirical studies that quantitatively compare the fairness of three specific recruitment scenarios:
1. Human-only: Recruiters searching manually.
2. AI-only: Automated candidate matching.
3. Human-AI Hybrid: A "human-in-the-loop" workflow where recruiters review AI recommendations before conducting manual searches.
Core Question: Does combining human oversight with AI efficiency mitigate bias, or does it result in a system that is merely the sum of its biased parts?

2. Methodology

2.1 Experimental Setting & Data

Context: The study was conducted at Jobindex, Denmark's largest job portal, over a 27-month period (April 2023 – July 2025).
Dataset: 58,765 job postings involving 1,348,916 contacted candidates.
Workflow Design: The study utilized a quasi-experimental design based on recruiter behavior:
- Human Recruiting: Recruiters skipped AI recommendations and searched manually.
- AI Recruiting: Analysis of the raw slate of 100 candidates recommended by the AI algorithm (Cross-Encoder architecture).
- Human + AI Recruiting: Recruiters viewed AI recommendations, then proceeded to manual search.
Interaction Metrics: The study tracked three levels of recruiter engagement for each candidate: Viewed (seen in snippets), Clicked (detailed profile accessed), and Contacted (final selection for outreach).

2.2 Gender Inference

Since Jobindex ceased asking for self-reported gender in 2021 to prevent bias, the authors inferred gender using first names.

Method: A frequency-based classifier trained on ~600k self-reported name-gender pairs.
Performance: The model achieved a mean F1-score of 99.25% with 94.5% coverage.
Validation: Analyses were cross-verified using only self-reported data where available, confirming that trends remained consistent.

2.3 Fairness Metrics

The study employed Conditional Demographic (Dis)Parity (CDP) rather than simple Demographic Parity (DP).

Rationale: CDP ( $\hat{Y} \perp A | Q$ ) measures fairness conditional on candidate qualifications ( $Q$ ), ensuring that selection rates are equal across genders among qualified candidates.
Calculation: The CDP ratio is calculated as $P(\hat{Y}=1 | A=a, Q) / P(\hat{Y}=1 | A=b, Q)$ $P (\hat{Y} = 1∣ A = a, Q) / P (\hat{Y} = 1∣ A = b, Q)$ .
- Ratio = 1.0: Perfect parity.
- Ratio < 1.0: Underrepresentation of the protected group (Female).
- Ratio > 1.0: Overrepresentation.
Threshold: The study references the US EEOC "80% rule" (0.8 ratio) as a heuristic for acceptable fairness, though noting its context-specific nature.

3. Key Results

3.1 Human-Only Recruiting (RQ1)

Findings: Female candidates were consistently underrepresented in the Viewed and Clicked sets (CDP < 0.7).
Trend: Fairness improved as human effort increased. The Contacted set (highest scrutiny) had a mean CDP of 0.813, crossing the 0.8 threshold.
Insight: More deliberation and time spent by recruiters lead to fairer outcomes, suggesting that human bias can be mitigated through careful review.

3.2 AI-Only Recruiting (RQ2)

Findings: The AI-generated recommendations were the least fair scenario.
- Recommended (Full Slate): CDP = 0.642.
- Recommended Top-K: CDP = 0.699.
Cause: The algorithm likely learned historical biases from training data where female candidates were previously underrepresented or where names/photos were visible during the training period.
Conclusion: Relying solely on AI risks systematically amplifying gender unfairness.

3.3 Human-AI Hybrid Recruiting (RQ3)

Findings: The hybrid approach produced the fairest outcomes of all three scenarios.
- Contacted (Hybrid): CDP = 0.854.
The "Sum of Parts" Effect: The hybrid system performed better than either component alone.
- AI Oversight Phase: Reviewing AI recommendations improved the fairness of the initial shortlist (CDP 0.791) compared to the raw AI output (0.699).
- Post-AI Oversight Phase: Manual searching after interacting with AI recommendations yielded the highest fairness (CDP 0.876).
Mechanism: The interaction with AI recommendations appears to act as a "priming" or "inspiration" mechanism, prompting recruiters to be more conscious of diversity during their subsequent manual searches.

3.4 Additional Analyses

Job Categories: Bias patterns vary significantly by industry. In female-dominated fields (e.g., Childcare, Nursing), male candidates were often overrepresented (CDP < 1), suggesting recruiters actively compensate for perceived gaps. In male-dominated fields, female underrepresentation was more common.
Trade-off: There was a negligible positive correlation between gender fairness (CDP) and positive response rates (candidates accepting the job offer), indicating that increasing fairness does not compromise hiring quality.

4. Key Contributions

First Empirical Comparison: Provides one of the first large-scale, quantitative comparisons of gender fairness across Human, AI, and Hybrid recruiting workflows in a real-world setting.
Validation of Human-in-the-Loop: Demonstrates that human oversight is not just a safety net but an active bias-mitigation tool. Crucially, the study shows that Human + AI > Human alone and Human + AI > AI alone.
Behavioral Insight: Reveals that the order of operations matters. Interacting with AI recommendations before manual searching yields better fairness than manual searching alone, suggesting AI can positively influence human decision-making heuristics.
Metric Application: Successfully applies Conditional Demographic Parity (CDP) to a dynamic, real-world recruitment environment, moving beyond static dataset analysis.

5. Significance & Implications

Policy & Practice: The findings challenge the narrative that AI is inherently more objective than humans. Instead, they suggest that AI should be used as an augmentation tool, not a replacement, to achieve the highest levels of equity.
System Design: HR platforms should design workflows that encourage recruiters to review AI suggestions before conducting manual searches, leveraging the "priming" effect to improve diversity outcomes.
Regulatory Context: The study supports the need for "human-in-the-loop" regulations in high-stakes algorithmic decision-making, providing empirical evidence that human oversight effectively reduces disparate impact.
Limitations: The study is specific to the Danish market and the Jobindex platform. The gender inference method, while robust, relies on names and may not capture non-binary identities or cultural nuances in naming conventions.

Conclusion: The paper concludes that the combination of human experience and AI efficiency creates a synergistic effect that produces the fairest candidate lists, offering a viable path toward more equitable hiring practices.