AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

This paper introduces AutoViVQA, a large-scale automatically constructed dataset for Vietnamese Visual Question Answering, and evaluates transformer-based multimodal models alongside various automatic metrics to assess their performance and alignment with human judgment in the Vietnamese context.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to understand the world through its eyes and ears. You show it a picture and ask, "What is happening here?" The robot needs to look at the image, understand the question, and give a smart answer. This is called Visual Question Answering (VQA).

For a long time, we've had great teachers and textbooks for English-speaking robots. But for Vietnamese-speaking robots? The library was almost empty. The few books that existed were either too short, too simple, or written by machines that made up facts (hallucinations).

Enter AutoViVQA. Think of this paper as the blueprint for building a massive, high-quality "Vietnamese VQA Library" from scratch, but with a twist: instead of hiring thousands of humans to write every single question (which is slow and expensive), the authors built a super-smart, automated factory to do it.

Here is how they did it, explained with some everyday analogies:

1. The Problem: The "Empty Bookshelf"

Imagine trying to teach a child to read using only a few torn pages from a magazine. That's what researchers were doing with Vietnamese VQA. The existing datasets were small, repetitive, and often didn't test if the robot was actually thinking or just guessing.

2. The Solution: The "AI Assembly Line"

The authors didn't just collect data; they built a factory (a pipeline) to manufacture it.

  • The Raw Materials: They took real-world photos (from a famous collection called MS COCO) and paired them with high-quality Vietnamese descriptions.

  • The Foreman (The LLM): They used a powerful Large Language Model (like a super-intelligent foreman) to act as the question writer. But they didn't just say, "Write some questions." They gave the foreman a strict rulebook.

  • The Rulebook (Reasoning Levels): This is the secret sauce. They told the foreman:

    • Level 1: Just name the object (e.g., "What color is the car?").
    • Level 2: Describe where things are (e.g., "Is the dog under the table?").
    • Level 3: Connect two things (e.g., "Why is the person holding an umbrella?").
    • Level 4 & 5: Get deep into cause-and-effect or reading text inside the image.

    They forced the factory to produce a balanced mix of easy, medium, and hard questions, ensuring the robot learns to reason, not just memorize.

3. The Quality Control: The "Taste-Test Panel"

In a normal factory, you might just check if the product looks okay. In this paper, they built a Quality Control Panel made of other AI models.

  • The Jury: Every single question and answer generated by the factory was sent to a "jury" of 5 to 10 different AI models.
  • The Vote: These models acted like judges. They asked: "Is this question confusing? Is the answer actually in the picture? Is the Vietnamese natural?"
  • The Filter: If the jury didn't agree that the sample was good, it was thrown in the trash. Only the samples that passed a strict "majority vote" made it into the final dataset.

This is like a chef cooking a meal, then sending it to a panel of 10 food critics. If 8 out of 10 say, "This tastes weird," the dish is scrapped. This ensured the final dataset was clean, accurate, and free of the "robot hallucinations" that usually plague AI-generated content.

4. The Result: A New Standard

The result is AutoViVQA, a dataset with nearly 20,000 images and over 180,000 answers.

  • Why it matters: When the researchers tested their new dataset on various AI models, the models got significantly smarter. It wasn't because they changed the robot's brain; it was because they finally gave it a better textbook.
  • The Analogy: Imagine a student taking a test. If the practice questions are vague and full of errors, the student will fail. If the practice questions are clear, varied, and challenging, the student will ace the real exam. AutoViVQA is that perfect set of practice questions for Vietnamese AI.

The Bottom Line

This paper is about solving a "low-resource" problem (not enough data for Vietnamese) by building a self-correcting, automated system. They proved that you don't need a million human annotators to build a great dataset; you just need a smart system, a strict rulebook, and a rigorous quality-checking jury.

They didn't just build a dataset; they built a factory for intelligence that can be used to train better, more culturally aware, and more logical AI for the Vietnamese language.