Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

Here is an explanation of the paper, translated into everyday language with some creative analogies.

🌊 The Big Picture: Saving Memories from the Mekong Delta

Imagine the Mekong Delta in Vietnam as a giant, vibrant library of stories, songs, festivals, and crafts. These aren't just books; they are "Intangible Cultural Heritage" (ICH)—things like traditional music, floating markets, and weaving techniques that live in people's minds and actions.

The researchers wanted to build a digital librarian (an AI) that could look at a photo and instantly say, "Ah, this is the Ok Om Bok festival!" or "This is bamboo weaving!"

The Problem:
Building this librarian is hard for three reasons:

Not enough photos: There aren't many high-quality pictures of these specific cultural events.
They look alike: A photo of a temple ceremony (Class 8) might look almost identical to a photo of a sea worship festival (Class 4). It's like trying to tell the difference between two twins wearing the same outfit.
The AI gets confused: When you train a standard AI on so few photos, it tends to "memorize" the training data instead of learning the actual rules. It's like a student who memorizes the answers to a practice test but fails the real exam because the questions were slightly different.

🍲 The Secret Sauce: "Model Soups"

To fix this, the researchers didn't just build one super-smart AI. Instead, they used a technique called Model Soups.

The Analogy: The Chef's Kitchen
Imagine you are a chef trying to make the perfect bowl of soup.

The Old Way: You train one chef to make soup. If they have a bad day or burn a batch, the whole thing is ruined.
The "Model Soups" Way: You train one chef, but you ask them to make the soup 20 times over a few days. On Day 1, the salt was perfect. On Day 5, the vegetables were crisp. On Day 10, the broth was rich.
The Magic: Instead of picking just one of those batches to serve, you take a spoonful from the Day 1 batch, a spoonful from Day 5, and a spoonful from Day 10, and mix them all together into one giant bowl.

In the world of AI, this "mixing" happens inside the computer's brain (the weights). The researchers took the "brain states" of the AI at different moments during its training and averaged them out. The result is a single, super-stable AI that combines the best parts of all those training moments.

Why is this cool?
Usually, if you want a better AI, you have to run 10 different AIs at the same time and let them vote on the answer. That's slow and expensive (like hiring 10 chefs to cook at once).
Model Soups is different. You only need one final AI model to run. It's like having one chef who has tasted every version of the soup and knows exactly how to balance the flavors perfectly. It's fast, cheap, and smarter.

🏗️ The Engine: CoAtNet

To make the soup, they needed a really good pot. They used an AI architecture called CoAtNet.

The Metaphor: Think of looking at a painting.
- Convolution (The "Local" Eye): This part of the AI looks at small details, like the texture of a weave or the pattern on a drum. It's great at seeing the "trees."
- Attention (The "Global" Eye): This part looks at the whole picture to understand the context, like seeing that the drum is being played in a festival crowd. It's great at seeing the "forest."
CoAtNet is a hybrid that uses both eyes at the same time. It's particularly good at understanding complex, messy images where details and context are both important.

📊 The Results: A Winning Recipe

The researchers tested this "Soup + CoAtNet" recipe on a dataset of 7,406 images representing 17 different cultural categories.

The Competition: They compared their method against famous AI models like ResNet, DenseNet, and ViT (Vision Transformer).
The Outcome: The "Model Soup" approach won. It achieved 72.36% accuracy, beating all the other models.
The "Why": By mixing the different versions of the AI, they reduced the "noise" (variance). It's like taking the average of 10 weather forecasts; you get a more reliable prediction than trusting just one meteorologist.

🔍 The Science Behind the Magic: Why It Works

The paper also did some detective work to prove why this works better than just asking multiple AIs to vote (called "Soft Voting").

The Map (MDS): They created a map of how the different AI models "think."
- Soft Voting is like gathering a group of friends who all think exactly the same way. If they are all wrong, they are all wrong together.
- Model Soups gathers friends who have different perspectives. One might focus on the color, another on the shape. When you mix their opinions, you get a much more balanced view.
The Result: The "Soup" models were spread out on the map, meaning they were diverse. This diversity is what makes the final prediction so robust.

🚀 The Takeaway

This paper shows that you don't always need more data or more powerful computers to get better AI results. Sometimes, you just need to be smarter about how you combine the knowledge you already have.

By taking a single training process, saving the "best moments" along the way, and blending them into a Model Soup, the researchers created a digital guardian for the Mekong Delta's culture that is more accurate, more stable, and more efficient than anything built before.

In short: They didn't just build a smarter AI; they built a wiser one by teaching it to listen to its own past selves.

Here is a detailed technical summary of the paper "Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta."

1. Problem Statement

The classification of Intangible Cultural Heritage (ICH) images from the Mekong Delta presents significant challenges for deep learning models due to:

Data Scarcity: Limited availability of high-quality, annotated datasets.
Visual Ambiguity: High visual similarity between different cultural classes (e.g., different festivals or ceremonies often share similar contexts, costumes, or settings).
Domain Heterogeneity: The dataset contains noisy and irrelevant images despite manual cleaning.
Generalization Issues: Conventional deep learning models trained on such low-resource data often suffer from high variance, overfitting to spurious correlations, and poor generalization to unseen test data.

The specific task involves classifying images into 17 distinct ICH categories (the ICH-17 dataset, containing 7,406 images), ranging from traditional music and weaving to specific religious festivals.

2. Methodology

The authors propose a robust framework combining a hybrid neural architecture with a lightweight weight-space ensembling technique.

A. Backbone Architecture: CoAtNet

The study utilizes CoAtNet, a hybrid vision model that unifies Convolutional Neural Networks (CNNs) and Transformers:

Design: It employs a stage-wise design ( $S_0$ to $S_4$ ). The initial stages use convolutional blocks (MBConv) to capture local features and inductive biases, while later stages transition to Transformer blocks with self-attention to model long-range global dependencies.
Rationale: This hybrid approach balances the efficiency of CNNs with the global context modeling of attention mechanisms, making it suitable for ICH images where both local details and global context are crucial.

B. Ensembling Strategy: Model Soups

Instead of training multiple independent models or using complex output-space ensembling (like Soft Voting), the authors apply Model Soups:

Concept: A technique that averages the weights of multiple checkpoints from a single training trajectory.
Procedure:
1. Checkpoint Collection: During training, the top $k=8$ checkpoints are saved based on validation metrics (loss, accuracy, and F1-score).
2. Selection: Two strategies are employed:
  - Uniform Soup: Averages all selected checkpoints.
  - Greedy Soup: Iteratively adds checkpoints to the soup only if the resulting average improves validation accuracy.
3. Inference: The final model is a single averaged weight vector, meaning no additional inference cost or memory overhead compared to a single model.

C. Experimental Setup

Dataset: ICH-17 (7,406 images, 17 classes). Split into 6,057 training, 600 validation, and 749 test images.
Baselines: ResNet-50, DenseNet-121, and Vision Transformer (ViT).
Pre-training: Models were fine-tuned from ImageNet-1k (CoAtNet-0) and ImageNet-12k (CoAtNet-2) pre-trained weights.
Training: Fine-tuned for 50 epochs using AdamW optimizer, MixUp, and CutMix augmentation.

3. Key Contributions

State-of-the-Art Performance: The proposed framework achieves the highest reported accuracy on the ICH-17 dataset, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT.
Efficient Ensembling: Demonstrates that Model Soups provides significant performance gains over single models without increasing inference latency or memory footprint, a critical advantage for deployment in resource-constrained environments.
Bias-Variance Analysis: The paper provides a theoretical and empirical analysis showing that Model Soups reduces variance by stabilizing predictions across diverse model snapshots while introducing minimal additional bias.
Diversity Visualization: Using Multidimensional Scaling (MDS) and cross-entropy distance metrics, the authors prove that Model Soups selects geometrically diverse checkpoints in the output space. In contrast, traditional Soft Voting tends to average redundant, clustered models.
Ablation on Pre-training: Highlights the critical necessity of pre-training for low-resource cultural datasets, showing a ~20% accuracy drop when models are trained from random initialization.

4. Experimental Results

The experiments were conducted on the CoAtNet-2 (larger) and CoAtNet-0 (smaller) variants.

Top-1 Accuracy:
- CoAtNet-2 + Uniform Soup: 72.36% (Best overall).
- CoAtNet-2 + Greedy Soup: 72.23%.
- Baseline CoAtNet-2: 71.43%.
- ViT (Baseline): 70.09%.
- ResNet-50 (Baseline): 65.55%.
Macro F1-Score:
- CoAtNet-2 + Uniform Soup: 69.28%.
- CoAtNet-2 + Greedy Soup: 69.05%.
- Baseline CoAtNet-2: 68.58%.
Per-Class Analysis: The Model Soup approach improved F1-scores in 11 out of 17 classes, with significant gains in ambiguous categories (e.g., Class 6 "Ok Om Bok Festival" improved from 59.79% to 69.31%).
Ablation (No Pre-training): Models trained from scratch achieved only ~51% accuracy, confirming that transfer learning is essential for this domain.

5. Significance and Conclusion

This research offers a principled and efficient solution for classifying culturally rich but data-scarce heritage images.

Practical Impact: By achieving state-of-the-art results with a single averaged model, the method is highly deployable for digital preservation projects where computational resources may be limited.
Theoretical Insight: The study validates that diversity-aware checkpoint averaging is superior to naive output averaging (Soft Voting). It effectively navigates the loss landscape to find a solution that generalizes better by reducing variance without sacrificing bias.
Future Directions: The authors suggest extending this framework to multi-modal learning (incorporating text metadata) and scaling it to ICH datasets from other global regions to support inclusive AI-driven cultural preservation.

In summary, the paper successfully demonstrates that combining hybrid architectures (CoAtNet) with weight-space ensembling (Model Soups) is a powerful strategy for overcoming the specific challenges of Intangible Cultural Heritage image classification.