WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

Matthias De Lange, Warre Veys, Federico Retyk, Daniel Deniz, Warren Jouanneau, Mike Zhang, Aleksander Bielinski, Emma Jouffroy, Nicole Clobes, Nina Baranowska, David Graus, Marc Palyart, Rabih Zbib, D

Published 2026-04-16

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine the world of hiring and job matching as a massive, chaotic library. Right now, if you want to find the perfect book (a job candidate) for a specific reader (an employer), you have to ask different librarians who all speak different languages, use different cataloging systems, and measure "success" in completely different ways. One librarian might say, "I found a match!" while another says, "That's not even in the same genre."

This is the problem WorkRB solves.

Here is a simple breakdown of the paper using everyday analogies:

1. The Problem: A Tower of Babel in Hiring

Currently, companies and researchers trying to build AI to help hire people are all working in silos.

Different Dictionaries: Some use the European dictionary (ESCO), others use the American one (O*NET), and others use their own made-up lists.
Different Rules: One team tests their AI on "finding similar jobs," while another tests on "extracting skills from a resume." You can't compare their scores because they are playing different games.
The Privacy Wall: Real hiring data (like salaries and career histories) is super sensitive. It's like a diary; companies can't just hand it over to researchers to test their tools. This makes it hard to improve AI safely.

The Result: Progress is slow because no one can agree on how to measure if an AI is actually good at its job.

2. The Solution: WorkRB (The Universal Translator & Scoreboard)

The authors created WorkRB (Work Research Benchmark). Think of it as a universal "Gym" for AI models.

Instead of building a new gym for every type of exercise, WorkRB is one giant facility with 13 different workout stations (tasks) where any AI can come to get tested.

The 13 Stations: These include tasks like:
- Matching a Job to a Skill: "If I'm a 'Chef,' what skills do I need?"
- Matching a Skill to a Job: "If I know 'Python,' what jobs can I do?"
- Cleaning Up Titles: Turning "Guru of Code" into the official title "Software Engineer."
- Finding Candidates: "Who is the best person for this specific project?"

3. How It Works: The "Lego" System

WorkRB is built like a set of Lego bricks (modular design).

Plug-and-Play: You can snap in your own AI model (the "player") and snap in your own dataset (the "challenge").
The Privacy Shield: If a company has secret data they can't share publicly, they can still use WorkRB to test their AI internally. The framework runs the test, gives them the score, but doesn't force them to upload their secret data to the public internet. It's like taking a driving test in your own car on a closed track, but using the same scoring rules as everyone else.
The Multilingual Superpower: Most AI tools are great at English but terrible at other languages. WorkRB is like a polyglot translator. It can test an AI in 28 different languages at once, ensuring the AI works just as well for a job seeker in Sweden as it does for one in Spain.

4. The Team: A Community Potluck

This isn't just one company trying to fix the problem. It's a community potluck involving three groups:

Industry (The Chefs): Companies like TechWolf and Malt bring real-world problems and data.
Academia (The Food Critics): Universities bring the science, math, and new ways to measure success.
Government (The Rule Makers): Organizations like the EU and US Labor Dept provide the official "menus" (standardized job and skill lists) so everyone is speaking the same language.

5. Why This Matters

Fairness: It stops companies from "cooking the books" by only testing their AI on easy data. Everyone uses the same scoreboard.
Safety: It helps ensure that AI used for hiring follows strict privacy laws (like GDPR) because the framework is designed to handle sensitive data responsibly.
Inclusion: By testing in 28 languages, it ensures that AI doesn't just work for English speakers, but for the whole world.

The Bottom Line

WorkRB is the first time the hiring world has agreed on a standardized, open-source rulebook. It allows anyone—from a startup in Paris to a university in Copenhagen—to test their hiring AI, see how it stacks up against the best, and improve it, all while keeping private data safe and speaking every language. It turns a chaotic free-for-all into a fair, organized, and collaborative sport.

1. Problem Statement

The field of Artificial Intelligence (AI) applied to the labor market (e.g., hiring, talent management, workforce analytics) suffers from severe fragmentation.

Lack of Standardization: Researchers and practitioners use divergent ontologies (ESCO, O*NET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross-study comparison and reproducibility nearly impossible.
Data Sensitivity: Employment data (career histories, compensation) is highly sensitive and subject to strict privacy regulations (GDPR, AI Act), limiting the availability of open datasets for benchmarking.
Benchmark Gaps: General-purpose NLP benchmarks (e.g., MTEB, BEIR) do not address work-specific tasks like job normalization or skill extraction, nor do they support the complex multilingual and cross-lingual requirements of global labor markets.
Scalability Issues: Evaluating across multiple tasks, languages, and ontologies requires bespoke setups for each study, hindering scalable progress.

2. Methodology: The WorkRB Framework

WorkRB (Work Research Benchmark) is an open-source, community-driven evaluation framework designed to unify datasets, baseline models, and evaluation metrics for work-domain AI.

A. Core Architecture

Unified Ranking Formulation: WorkRB frames 13 distinct tasks across 7 task groups as ranking problems. This allows for consistent evaluation using standard Information Retrieval (IR) metrics.
Dynamic Ontology Loading: The framework supports up to 28 languages by dynamically loading multilingual ontologies (primarily ESCO and O*NET) on demand. This enables both monolingual (e.g., German query $\to$ German target) and cross-lingual (e.g., French query $\to$ English target) evaluation within the same task structure.
Modular & Extensible Design:
- Pip-installable: Designed as a Python library (pip install workrb).
- Abstract Base Classes: Custom tasks inherit from RankingTask and models from ModelInterface.
- Proprietary Integration: Organizations can register private datasets and models locally without exposing sensitive data to the public repository, facilitating secure internal benchmarking.
Automated Workflow: Features automatic checkpointing (saving results per task/dataset) to allow resuming interrupted evaluations and hierarchical metric aggregation.

B. Task Groups & Datasets

WorkRB organizes 13 tasks into 7 groups, covering core retrieval and ranking scenarios:

Occupation-to-Skill Rec. (SRec): Ranks skills for a given occupation.
Skill-to-Occupation Rec. (ORec): Ranks occupations for a given skill.
Similar Item Rec. (SIRec): Semantic similarity for job titles or skills.
Candidate Recommendation (CRec): Ranking candidate profiles against project descriptions/queries (cross-lingual).
Job Normalization (JNorm): Mapping free-text job titles to standardized ontology entries (Entity Linking).
Skill Normalization (SNorm): Mapping skill surface forms to canonical ontology entries.
Skill Extraction (SExtr): Retrieving relevant skills from job descriptions.

The framework supports datasets from multiple sources, including ESCO, JobBERT, MELO, MELS, and proprietary industry data, totaling hundreds of thousands of query-target pairs across 28 languages.

C. Baseline Models

WorkRB provides implementations for a wide range of models to serve as baselines:

Lexical/Traditional: BM25, TF-IDF, Edit Distance, Random Ranking.
Neural/Embedding:
- Bi-Encoders: Wrappers for Sentence-Transformers (e.g., all-MiniLM-L6-v2).
- Domain-Specific: JobBERT (v1-v3), CurriculumMatch, ConTeXTMatch.
- Large Models: Qwen3-Embed (0.6B).

3. Key Contributions

Unified Benchmark: The first open-source framework to scale evaluation across 13 diverse work-domain tasks and 7 task groups, supporting up to 28 languages via dynamic ontology loading.
Extensible Toolkit: A modular, pip-installable system that allows for the seamless integration of proprietary tasks and models, solving the data privacy bottleneck through local extension capabilities.
Multi-Stakeholder Ecosystem: A collaborative model bridging Academia (methodological innovation), Industry (real-world tasks and datasets), and Government (maintenance of taxonomic backbones like ESCO/O*NET).
Standardized Metrics & Aggregation: Implements a four-level metric aggregation hierarchy (Language $\to$ Task $\to$ Task Group $\to$ Overall Benchmark) using standard IR metrics (MAP, NDCG, R-Precision@10, MRR).

4. Results (Baseline Performance)

The paper reports Mean Average Precision (MAP) scores across the task groups for a subset of baselines:

Lexical vs. Neural: Lexical baselines (BM25) significantly outperform random ranking but are generally outperformed by neural embedding models.
Domain Specialization: Domain-specific models (e.g., JobBERT-v3, CurriculumMatch) consistently achieve higher scores than general-purpose models (e.g., Qwen3-Embed) on most tasks.
- Example: On Skill Normalization (SNorm), CurriculumMatch achieved 88.2 MAP, compared to 65.5 for Qwen3-Embed and 47.4 for BM25.
- Example: On Candidate Recommendation (CRec), Qwen3-Embed scored 50.4, while JobBERT-v3 scored 43.4, showing that general large models can compete in specific cross-lingual scenarios.
Multilingual Capability: The framework successfully evaluated models across 28 languages, demonstrating that cross-lingual setups are feasible and measurable within a unified system.

5. Significance and Broader Impact

Regulatory Compliance: By enabling standardized, transparent, and reproducible evaluation, WorkRB supports compliance with the EU AI Act and GDPR, particularly for "high-risk" employment AI systems. It allows for independent auditing without requiring the release of sensitive raw data.
Language Representativeness: Shifts the focus from English-centric evaluation to a truly multilingual approach, improving accessibility for underrepresented language populations in the labor market.
Community Growth: Establishes a sustainable contribution model that incentivizes participation from all sectors (industry, academia, government), fostering a shared resource for the global HR and AI community.
Future Directions: The framework is designed to evolve, with planned extensions to include career path recommendation, temporal evaluation dimensions, and integration of Large Language Models (LLMs) and re-ranking architectures.

In summary, WorkRB addresses the critical lack of standardization in Work-Domain AI by providing a flexible, privacy-preserving, and multilingual evaluation ecosystem that bridges the gap between academic research and industrial application.