Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

This paper proposes a robust framework combining the hybrid CoAtNet architecture with model soups ensembling to effectively classify Intangible Cultural Heritage images from the Mekong Delta, achieving state-of-the-art performance on the ICH-17 dataset by reducing variance and enhancing generalization in data-scarce, high-similarity settings.

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

Published 2026-03-10
📖 5 min read🧠 Deep dive

Here is an explanation of the paper, translated into everyday language with some creative analogies.

🌊 The Big Picture: Saving Memories from the Mekong Delta

Imagine the Mekong Delta in Vietnam as a giant, vibrant library of stories, songs, festivals, and crafts. These aren't just books; they are "Intangible Cultural Heritage" (ICH)—things like traditional music, floating markets, and weaving techniques that live in people's minds and actions.

The researchers wanted to build a digital librarian (an AI) that could look at a photo and instantly say, "Ah, this is the Ok Om Bok festival!" or "This is bamboo weaving!"

The Problem:
Building this librarian is hard for three reasons:

  1. Not enough photos: There aren't many high-quality pictures of these specific cultural events.
  2. They look alike: A photo of a temple ceremony (Class 8) might look almost identical to a photo of a sea worship festival (Class 4). It's like trying to tell the difference between two twins wearing the same outfit.
  3. The AI gets confused: When you train a standard AI on so few photos, it tends to "memorize" the training data instead of learning the actual rules. It's like a student who memorizes the answers to a practice test but fails the real exam because the questions were slightly different.

🍲 The Secret Sauce: "Model Soups"

To fix this, the researchers didn't just build one super-smart AI. Instead, they used a technique called Model Soups.

The Analogy: The Chef's Kitchen
Imagine you are a chef trying to make the perfect bowl of soup.

  • The Old Way: You train one chef to make soup. If they have a bad day or burn a batch, the whole thing is ruined.
  • The "Model Soups" Way: You train one chef, but you ask them to make the soup 20 times over a few days. On Day 1, the salt was perfect. On Day 5, the vegetables were crisp. On Day 10, the broth was rich.
  • The Magic: Instead of picking just one of those batches to serve, you take a spoonful from the Day 1 batch, a spoonful from Day 5, and a spoonful from Day 10, and mix them all together into one giant bowl.

In the world of AI, this "mixing" happens inside the computer's brain (the weights). The researchers took the "brain states" of the AI at different moments during its training and averaged them out. The result is a single, super-stable AI that combines the best parts of all those training moments.

Why is this cool?
Usually, if you want a better AI, you have to run 10 different AIs at the same time and let them vote on the answer. That's slow and expensive (like hiring 10 chefs to cook at once).
Model Soups is different. You only need one final AI model to run. It's like having one chef who has tasted every version of the soup and knows exactly how to balance the flavors perfectly. It's fast, cheap, and smarter.

🏗️ The Engine: CoAtNet

To make the soup, they needed a really good pot. They used an AI architecture called CoAtNet.

  • The Metaphor: Think of looking at a painting.
    • Convolution (The "Local" Eye): This part of the AI looks at small details, like the texture of a weave or the pattern on a drum. It's great at seeing the "trees."
    • Attention (The "Global" Eye): This part looks at the whole picture to understand the context, like seeing that the drum is being played in a festival crowd. It's great at seeing the "forest."
  • CoAtNet is a hybrid that uses both eyes at the same time. It's particularly good at understanding complex, messy images where details and context are both important.

📊 The Results: A Winning Recipe

The researchers tested this "Soup + CoAtNet" recipe on a dataset of 7,406 images representing 17 different cultural categories.

  • The Competition: They compared their method against famous AI models like ResNet, DenseNet, and ViT (Vision Transformer).
  • The Outcome: The "Model Soup" approach won. It achieved 72.36% accuracy, beating all the other models.
  • The "Why": By mixing the different versions of the AI, they reduced the "noise" (variance). It's like taking the average of 10 weather forecasts; you get a more reliable prediction than trusting just one meteorologist.

🔍 The Science Behind the Magic: Why It Works

The paper also did some detective work to prove why this works better than just asking multiple AIs to vote (called "Soft Voting").

  • The Map (MDS): They created a map of how the different AI models "think."
    • Soft Voting is like gathering a group of friends who all think exactly the same way. If they are all wrong, they are all wrong together.
    • Model Soups gathers friends who have different perspectives. One might focus on the color, another on the shape. When you mix their opinions, you get a much more balanced view.
  • The Result: The "Soup" models were spread out on the map, meaning they were diverse. This diversity is what makes the final prediction so robust.

🚀 The Takeaway

This paper shows that you don't always need more data or more powerful computers to get better AI results. Sometimes, you just need to be smarter about how you combine the knowledge you already have.

By taking a single training process, saving the "best moments" along the way, and blending them into a Model Soup, the researchers created a digital guardian for the Mekong Delta's culture that is more accurate, more stable, and more efficient than anything built before.

In short: They didn't just build a smarter AI; they built a wiser one by teaching it to listen to its own past selves.