VietJobs: A Vietnamese Job Advertisement Dataset

Imagine Vietnam's job market as a massive, bustling bazaar. For years, if you wanted to study how people buy and sell jobs there, you had to wander around with a notebook, trying to remember snippets of conversations from a few stalls. It was messy, incomplete, and hard to compare.

This paper introduces VietJobs, which is like handing researchers a giant, perfectly organized digital library containing 48,000 job advertisements from every corner of Vietnam. It's the first time anyone has gathered this much information in one place, covering everything from factory mechanics in the north to software engineers in the south.

Here is a breakdown of what they did, using some everyday analogies:

1. The Big Collection (The Dataset)

Think of the researchers as digital librarians. They didn't just pick a few random books; they scanned the entire "Job Bazaar" (specifically the TopCV website) for a week.

The Scale: They collected over 15 million words. That's like reading every single job ad in the country and stacking the papers high enough to reach the moon.
The Variety: They organized these ads into 16 different categories (like sorting books into "Fiction," "Science," and "History"). They found that "Business & Sales" and "Factory Work" are the biggest sections, while niche fields like "Agriculture" are smaller but still important.
The Money Talk: They also looked at the salary tags. They found that most jobs pay between 10 to 15 million VND a month (roughly the cost of a decent apartment rent in a big city), but some high-level manager jobs go way higher. Interestingly, about 30% of the ads didn't list a price at all, saying "negotiable" instead—like a shopkeeper who only tells you the price if you ask nicely.

2. The Test Drive (The AI Experiments)

Now that they had this giant library, they wanted to see if Artificial Intelligence (AI) could act like a smart career counselor. They gave the AI two main tasks:

The Sorting Hat: Can the AI read a job description and instantly guess which of the 16 categories it belongs to? (e.g., "Is this a 'Cooking' job or an 'IT' job?")
The Crystal Ball: Can the AI look at the job details (title, location, experience) and predict the salary?

They tested different types of AI "brains":

The Global Travelers: Big models trained on many languages (like Qwen and Llama).
The Local Guides: Models specifically trained on Southeast Asian languages.
The Vietnamese Natives: Models built specifically for the Vietnamese language.

3. The Results (Who Won?)

The results were a bit like a race with different terrains:

For Sorting Jobs (Classification):
The Global Travelers (specifically Qwen2.5) were the champions. Even without being taught the specific rules of the Vietnamese job market, their massive training on many languages helped them understand the context better than the local-only models. It's like a polyglot tourist who can figure out a menu in a foreign country just by looking at the pictures, even if they don't speak the language fluently.
For Predicting Salaries (Estimation):
This was harder. The AI struggled to guess the exact price tag just by reading the ad. However, the Local Guide model (Llama-SEA-LION) performed the best. It was like a local real estate agent who knows the neighborhood prices better than a foreign expert.
- The Secret Sauce: The AI got much better when they "fed" it more data. When they trained the AI on both the new VietJobs library and an older, smaller dataset, it became a true expert. It's like studying from two different textbooks instead of just one; the more examples you see, the better you get at guessing the pattern.

4. Why Does This Matter?

Before this paper, studying the Vietnamese job market with computers was like trying to solve a puzzle with half the pieces missing.

For Researchers: They now have a complete puzzle to study how language, gender, and location affect hiring.
For Society: It helps us understand if job ads are fair or if they have hidden biases (like asking for specific ages or looks).
For the Future: This dataset is the foundation for building better AI tools that can help job seekers find the right roles and help companies understand what they are offering.

The Catch (Limitations)

The authors are honest about the flaws. The data only came from one website (TopCV), so it might miss the "underground" job market or informal gigs. Also, some salary numbers were rounded or missing, like a menu with prices that say "Ask for price" instead of listing a number.

In a nutshell: This paper built the first massive, organized map of Vietnam's online job world and tested how well AI can navigate it. The result? AI is getting pretty good at sorting jobs, and with a little more training, it's learning to predict salaries, paving the way for smarter, fairer hiring tools in the future.

Here is a detailed technical summary of the paper "VietJobs: A Vietnamese Job Advertisement Dataset":

1. Problem Statement

The paper addresses the scarcity of large-scale, publicly available, and well-annotated datasets for Vietnamese Natural Language Processing (NLP), specifically within the domain of labor market analysis.

Data Gap: While recruitment language research is advanced in high-resource languages (e.g., English), Vietnamese remains under-resourced. Existing resources are either too small, lack full textual descriptions (focusing only on titles), or are not publicly accessible.
Linguistic Challenges: Vietnamese presents unique NLP challenges, including tonal structures, compounding morphology, and frequent code-switching with English, which complicate tokenization and semantic interpretation.
Socio-economic Blind Spots: There is a lack of computational tools to analyze how recruitment language reflects social norms (e.g., gender, age, appearance biases) and labor market dynamics in Vietnam.

2. Methodology

A. Dataset Construction (VietJobs)

Data Collection: The authors compiled 48,092 job advertisements from the platform TopCV.vn across all 34 provinces and municipalities of Vietnam in July 2025.
Extraction Pipeline:
- Crawling: Used the open-source Crawl4AI framework for initial URL acquisition and page crawling.
- Parsing: Employed Large Language Models (GPT-4o and Gemini 2.5) via API to parse diverse HTML templates, extracting structured fields while preserving linguistic integrity.
- Ethics: Collection adhered to institutional ethical guidelines and national regulations; no personally identifiable information (PII) was collected.
Data Normalization:
- Taxonomy: Mapped 24 raw source categories into 16 consolidated occupational domains (based on ISCO-08, O*NET, and ESCO standards) to ensure consistency.
- Salary Standardization: Extracted and normalized salary data into min, max, and average fields (in millions of VND). Approximately 71.5% of postings contained explicit salary ranges.
Corpus Statistics: The dataset contains over 15.4 million tokens with a vocabulary of ~78,000 unique tokens. It covers 16 job categories, with the largest segments being Business/Sales and Manufacturing.

B. Experimental Setup

The authors benchmarked 10 Generative Large Language Models (LLMs) across three categories:

Multilingual: Qwen2.5-7B, Llama-3.1-8B, Granite-3.3-8B, Ministral-8B.
ASEAN-focused: Llama-SEA-LION-v3-8B-IT, Sailor2-8B, SeaLLMs-v3-7B.
Vietnamese-specific: PhoGPT-4B, BloomVN-8B, Vistral-7B.

Tasks Evaluated:

Job Category Classification: Predicting one of 16 standardized categories from job descriptions.
Salary Estimation: Predicting salary values (in "X triệu VND") based on structured attributes (title, location, contract type, experience).

Training Configurations:

Zero-shot: Direct prompting without examples.
Few-shot: Prompting with a few annotated examples.
Fine-tuning: Parameter-efficient fine-tuning using LoRA (Rank=8, $\alpha$ =16) on an NVIDIA A40 GPU. Models were fine-tuned on VietJobs, a separate "Vietnam Jobs Dataset" (Kaggle), and a combination of both.

3. Key Contributions

VietJobs Dataset: The first large-scale, open-access corpus of Vietnamese job ads, offering rich linguistic, demographic, and structural data for 48k+ postings.
Benchmarking Framework: Established a rigorous evaluation protocol for Vietnamese NLP, comparing global, regional, and local models under zero-shot, few-shot, and fine-tuned settings.
Empirical Insights: Provided the first systematic analysis of how different model architectures (multilingual vs. regionally adapted) perform on Vietnamese labor market tasks.
Resource Availability: All code, data, and resources are publicly released at https://github.com/VinNLP/VietJobs.

4. Results

Job Category Classification

Best Performer: Qwen2.5-7B-Instruct achieved the highest accuracy (0.47) and Macro F1 (0.42) in the few-shot setting.
Observations:
- Instruction-tuned multilingual models generally outperformed Vietnamese-specific models in zero-shot and few-shot scenarios.
- Fine-tuning did not consistently outperform few-shot prompting; some models (e.g., Ministral-8B) saw performance drops, suggesting overfitting or limited gains from the specific fine-tuning setup.
- Zero-shot performance was generally low (Acc ~0.31 for Qwen), highlighting the need for context or adaptation.

Salary Estimation

Best Performer: Llama-SEA-LION-v3-8B-IT demonstrated the most robust performance across all settings.
- Zero-shot: RMSE 11.72, $R^2$ 0.07.
- Fine-tuned (Combined Datasets): RMSE 12.40, $R^2$ 0.16.
Key Findings:
- Fine-tuning Impact: Performance improved significantly with fine-tuning, particularly when models were trained on the combined dataset (VietJobs + Vietnam Jobs Dataset). This suggests that data diversity and broader industry coverage are crucial for salary prediction.
- Regional Advantage: The regionally adapted model (Llama-SEA-LION) outperformed global models (like Llama-3.1) and Vietnamese-specific models, indicating that training on Southeast Asian linguistic nuances is highly beneficial for this domain.
- Baseline Issues: Several models (e.g., Vistral-7B) produced invalid outputs or extremely high RMSE in zero-shot settings, failing to parse the salary format correctly.

5. Significance and Future Work

Foundational Resource: VietJobs bridges the gap between computational NLP and socio-economic labor market analysis in Vietnam, enabling research into bias, code-switching, and regional economic disparities.
Model Insights: The study demonstrates that for low-resource languages like Vietnamese, regionally adapted multilingual models (trained on ASEAN data) often outperform both generic global models and models trained exclusively on the target language, likely due to better handling of code-switching and cultural context.
Future Directions:
- Expanding data collection to multiple platforms and time periods to capture temporal trends.
- Incorporating multilingual and demographic data for deeper bias analysis.
- Exploring advanced techniques like Retrieval-Augmented Generation (RAG) and domain-adaptive pretraining.
- Comparing LLMs against traditional ML baselines (e.g., TF-IDF + Logistic Regression) to quantify the specific value of generative models.

In conclusion, VietJobs establishes a new benchmark for Vietnamese NLP, proving that while large-scale multilingual models are powerful, regionally fine-tuned models combined with diverse data sources offer the most effective approach for structured labor market prediction tasks.