VietJobs: A Vietnamese Job Advertisement Dataset

This paper introduces VietJobs, the first large-scale public corpus of 48,092 Vietnamese job advertisements, and benchmarks generative large language models on tasks like job classification and salary estimation to advance Vietnamese NLP and labor market analytics.

Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine Vietnam's job market as a massive, bustling bazaar. For years, if you wanted to study how people buy and sell jobs there, you had to wander around with a notebook, trying to remember snippets of conversations from a few stalls. It was messy, incomplete, and hard to compare.

This paper introduces VietJobs, which is like handing researchers a giant, perfectly organized digital library containing 48,000 job advertisements from every corner of Vietnam. It's the first time anyone has gathered this much information in one place, covering everything from factory mechanics in the north to software engineers in the south.

Here is a breakdown of what they did, using some everyday analogies:

1. The Big Collection (The Dataset)

Think of the researchers as digital librarians. They didn't just pick a few random books; they scanned the entire "Job Bazaar" (specifically the TopCV website) for a week.

  • The Scale: They collected over 15 million words. That's like reading every single job ad in the country and stacking the papers high enough to reach the moon.
  • The Variety: They organized these ads into 16 different categories (like sorting books into "Fiction," "Science," and "History"). They found that "Business & Sales" and "Factory Work" are the biggest sections, while niche fields like "Agriculture" are smaller but still important.
  • The Money Talk: They also looked at the salary tags. They found that most jobs pay between 10 to 15 million VND a month (roughly the cost of a decent apartment rent in a big city), but some high-level manager jobs go way higher. Interestingly, about 30% of the ads didn't list a price at all, saying "negotiable" instead—like a shopkeeper who only tells you the price if you ask nicely.

2. The Test Drive (The AI Experiments)

Now that they had this giant library, they wanted to see if Artificial Intelligence (AI) could act like a smart career counselor. They gave the AI two main tasks:

  1. The Sorting Hat: Can the AI read a job description and instantly guess which of the 16 categories it belongs to? (e.g., "Is this a 'Cooking' job or an 'IT' job?")
  2. The Crystal Ball: Can the AI look at the job details (title, location, experience) and predict the salary?

They tested different types of AI "brains":

  • The Global Travelers: Big models trained on many languages (like Qwen and Llama).
  • The Local Guides: Models specifically trained on Southeast Asian languages.
  • The Vietnamese Natives: Models built specifically for the Vietnamese language.

3. The Results (Who Won?)

The results were a bit like a race with different terrains:

  • For Sorting Jobs (Classification):
    The Global Travelers (specifically Qwen2.5) were the champions. Even without being taught the specific rules of the Vietnamese job market, their massive training on many languages helped them understand the context better than the local-only models. It's like a polyglot tourist who can figure out a menu in a foreign country just by looking at the pictures, even if they don't speak the language fluently.

  • For Predicting Salaries (Estimation):
    This was harder. The AI struggled to guess the exact price tag just by reading the ad. However, the Local Guide model (Llama-SEA-LION) performed the best. It was like a local real estate agent who knows the neighborhood prices better than a foreign expert.

    • The Secret Sauce: The AI got much better when they "fed" it more data. When they trained the AI on both the new VietJobs library and an older, smaller dataset, it became a true expert. It's like studying from two different textbooks instead of just one; the more examples you see, the better you get at guessing the pattern.

4. Why Does This Matter?

Before this paper, studying the Vietnamese job market with computers was like trying to solve a puzzle with half the pieces missing.

  • For Researchers: They now have a complete puzzle to study how language, gender, and location affect hiring.
  • For Society: It helps us understand if job ads are fair or if they have hidden biases (like asking for specific ages or looks).
  • For the Future: This dataset is the foundation for building better AI tools that can help job seekers find the right roles and help companies understand what they are offering.

The Catch (Limitations)

The authors are honest about the flaws. The data only came from one website (TopCV), so it might miss the "underground" job market or informal gigs. Also, some salary numbers were rounded or missing, like a menu with prices that say "Ask for price" instead of listing a number.

In a nutshell: This paper built the first massive, organized map of Vietnam's online job world and tested how well AI can navigate it. The result? AI is getting pretty good at sorting jobs, and with a little more training, it's learning to predict salaries, paving the way for smarter, fairer hiring tools in the future.