WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

WorkRB is the first open-source, community-driven benchmark designed to unify fragmented research in work-domain AI by organizing 13 diverse tasks into a modular framework that supports cross-study comparison, multilingual evaluation, and the integration of sensitive employment data.

Matthias De Lange, Warre Veys, Federico Retyk, Daniel Deniz, Warren Jouanneau, Mike Zhang, Aleksander Bielinski, Emma Jouffroy, Nicole Clobes, Nina Baranowska, David Graus, Marc Palyart, Rabih Zbib, D
Published 2026-04-16
📖 4 min read☕ Coffee break read

Imagine the world of hiring and job matching as a massive, chaotic library. Right now, if you want to find the perfect book (a job candidate) for a specific reader (an employer), you have to ask different librarians who all speak different languages, use different cataloging systems, and measure "success" in completely different ways. One librarian might say, "I found a match!" while another says, "That's not even in the same genre."

This is the problem WorkRB solves.

Here is a simple breakdown of the paper using everyday analogies:

1. The Problem: A Tower of Babel in Hiring

Currently, companies and researchers trying to build AI to help hire people are all working in silos.

  • Different Dictionaries: Some use the European dictionary (ESCO), others use the American one (O*NET), and others use their own made-up lists.
  • Different Rules: One team tests their AI on "finding similar jobs," while another tests on "extracting skills from a resume." You can't compare their scores because they are playing different games.
  • The Privacy Wall: Real hiring data (like salaries and career histories) is super sensitive. It's like a diary; companies can't just hand it over to researchers to test their tools. This makes it hard to improve AI safely.

The Result: Progress is slow because no one can agree on how to measure if an AI is actually good at its job.

2. The Solution: WorkRB (The Universal Translator & Scoreboard)

The authors created WorkRB (Work Research Benchmark). Think of it as a universal "Gym" for AI models.

Instead of building a new gym for every type of exercise, WorkRB is one giant facility with 13 different workout stations (tasks) where any AI can come to get tested.

  • The 13 Stations: These include tasks like:
    • Matching a Job to a Skill: "If I'm a 'Chef,' what skills do I need?"
    • Matching a Skill to a Job: "If I know 'Python,' what jobs can I do?"
    • Cleaning Up Titles: Turning "Guru of Code" into the official title "Software Engineer."
    • Finding Candidates: "Who is the best person for this specific project?"

3. How It Works: The "Lego" System

WorkRB is built like a set of Lego bricks (modular design).

  • Plug-and-Play: You can snap in your own AI model (the "player") and snap in your own dataset (the "challenge").
  • The Privacy Shield: If a company has secret data they can't share publicly, they can still use WorkRB to test their AI internally. The framework runs the test, gives them the score, but doesn't force them to upload their secret data to the public internet. It's like taking a driving test in your own car on a closed track, but using the same scoring rules as everyone else.
  • The Multilingual Superpower: Most AI tools are great at English but terrible at other languages. WorkRB is like a polyglot translator. It can test an AI in 28 different languages at once, ensuring the AI works just as well for a job seeker in Sweden as it does for one in Spain.

4. The Team: A Community Potluck

This isn't just one company trying to fix the problem. It's a community potluck involving three groups:

  • Industry (The Chefs): Companies like TechWolf and Malt bring real-world problems and data.
  • Academia (The Food Critics): Universities bring the science, math, and new ways to measure success.
  • Government (The Rule Makers): Organizations like the EU and US Labor Dept provide the official "menus" (standardized job and skill lists) so everyone is speaking the same language.

5. Why This Matters

  • Fairness: It stops companies from "cooking the books" by only testing their AI on easy data. Everyone uses the same scoreboard.
  • Safety: It helps ensure that AI used for hiring follows strict privacy laws (like GDPR) because the framework is designed to handle sensitive data responsibly.
  • Inclusion: By testing in 28 languages, it ensures that AI doesn't just work for English speakers, but for the whole world.

The Bottom Line

WorkRB is the first time the hiring world has agreed on a standardized, open-source rulebook. It allows anyone—from a startup in Paris to a university in Copenhagen—to test their hiring AI, see how it stacks up against the best, and improve it, all while keeping private data safe and speaking every language. It turns a chaotic free-for-all into a fair, organized, and collaborative sport.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →