Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds

Imagine you are walking through a bustling market in South Asia. It's a sensory explosion: a temple bell rings, a rickshaw horn blares, a street vendor shouts, a tiger roars in a nearby sanctuary, and a train rumbles in the distance. All these sounds happen at the exact same time, blending into one chaotic noise.

The Problem:
For a computer to understand this, it's like trying to identify every single ingredient in a giant, mixed-up fruit smoothie just by taking a sip. Traditional methods (like the ones the paper calls MFCC) are like trying to taste the smoothie and guess the ingredients based on a simple list of flavors. They often get confused because the flavors (sounds) are too mixed up. They struggle when there are too many things happening at once.

The Solution:
The researchers in this paper decided to stop "tasting" the sound and start looking at it. They turned the audio into a picture called a Spectrogram.

The Analogy: Think of a spectrogram as a sonic fingerprint or a heat map of sound. Instead of just hearing the noise, the computer sees a colorful image where the horizontal axis is time, the vertical axis is pitch (high vs. low notes), and the colors show how loud each part is.
The Magic: In this picture, the train engine might look like a thick, low red bar, while a flute looks like a thin, high blue line. Even if they overlap, the computer can see the distinct shapes of each sound, just like you can see a red car and a blue bike parked next to each other, even if they are touching.

How They Did It:

The Dataset (The Training School): They built a massive library of sounds called SAS-KIIT. It contains 21 specific South Asian sounds, from religious prayers (Azan, Aroti) and traditional instruments (Tanpura, Tabla) to nature sounds (Tigers, storms) and city noise (Rickshaw horns). They also mixed these sounds together randomly to simulate real life.
The Brain (The CNN): They fed these "sound pictures" (spectrograms) into a special type of AI brain called a Convolutional Neural Network (CNN). You can think of this CNN as a super-observant detective that looks at the spectrogram images and learns to recognize the "shapes" of different sounds.
The Goal (Multilabel Classification): The goal wasn't just to say "This is a train." It was to say, "This is a train AND a temple bell AND a rickshaw." This is called multilabel classification.

The Results:
The researchers tested their new "Sound Detective" against the old "Taste Tester" (MFCC methods) and some other fancy AI models.

The Outcome: The Spectrogram detective won hands down. It was much better at untangling the messy mix of sounds.
- On the South Asian dataset, it got 96% accuracy.
- On a global city noise dataset, it got 85% accuracy.
Why it matters: The old methods got confused by the chaos. The new method, by looking at the visual patterns of sound, could pick out the individual voices in the crowd with high precision.

Why This Is a Big Deal:

Cultural Preservation: It helps us document and understand the unique, chaotic, and beautiful soundscapes of South Asia that are often ignored by standard technology.
Real-World Use: Imagine a city sensor that can listen to a street and automatically report: "There is a siren, a construction jackhammer, and a dog barking." This helps with urban planning, safety, and monitoring.
Efficiency: Their model is not only more accurate but also simpler and faster than some of the massive, complex AI models currently in use.

In a Nutshell:
This paper teaches computers to stop trying to "hear" a messy room and start "seeing" the sound patterns instead. By turning noise into pictures, they built a smarter system that can identify multiple things happening at once, making it a powerful tool for understanding the noisy, vibrant world around us.

Here is a detailed technical summary of the paper "Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds."

1. Problem Statement

The paper addresses the challenge of Environmental Sound Classification (ESC) in complex, real-world environments, specifically focusing on the South Asian region.

The Core Challenge: South Asian soundscapes are characterized by a high degree of acoustic density where natural, human-made, and cultural sounds frequently overlap. Traditional classification methods struggle with these multi-source, overlapping signals.
Limitations of Existing Methods:
- Blind Source Separation (BSS): Techniques like ICA and PCA require prior knowledge of the number of sources, which is often unknown in dynamic environments.
- MFCC Reliance: Most existing deep learning models rely on Mel-Frequency Cepstral Coefficients (MFCC). The authors argue that MFCCs often fail to capture fine-grained temporal and frequency variations necessary for distinguishing overlapping sounds.
- Dataset Bias: Many existing solutions are trained on single-label datasets or culturally limited data, lacking the robustness to handle the chaotic, multi-label nature of South Asian audio.

2. Methodology

The authors propose a spectrogram-driven Convolutional Neural Network (CNN) framework designed specifically for multilabel, multiclass classification.

A. Datasets

The study utilizes two datasets to ensure both regional relevance and global comparability:

SAS-KIIT (South Asian Sounds): A curated dataset containing 21 distinct sound classes (e.g., Tanpura, Azan, Rickshaw Horn, Tiger, Kalboishakhi Storm) from India, Bangladesh, Nepal, Bhutan, and Afghanistan. It includes 9,450 segments (4 seconds each) and was expanded to include rare environmental sounds.
UrbanSound8K: A standard benchmark dataset with 10 urban noise classes (e.g., Air Conditioner, Siren, Gun Shot) used for comparative validation.

Data Augmentation (Audio Mixing):
To simulate real-world complexity, the authors created mixed audio files by combining 1 to 4 distinct sound segments.

Fixed Mix: 3 sources combined.
Variable Mix: 1 to 4 sources combined.
Total Mixed Samples: 8,000 generated samples.

B. Feature Extraction

Two feature extraction pipelines were compared:

MFCC: 40 coefficients extracted per frame, padded/truncated to 400 frames.
Mel Spectrograms: The primary proposed input. Audio is converted to a time-frequency representation using the Short-Time Fourier Transform (STFT) and mapped to a Mel-scale filter bank (128 filters, max 8kHz). This preserves the visual texture of sound overlaps, which CNNs can interpret effectively.

C. Model Architecture

A custom CNN architecture was designed to process Mel spectrogram images (resized to $128 \times 128$ pixels):

Convolutional Blocks: Four stages with increasing filter counts (64, 128, 256, 512) using $3 \times 3$ kernels and ReLU activation.
Pooling: Max-pooling layers reduce spatial dimensions to prevent overfitting.
Fully Connected Layers: A dense layer (128 neurons) followed by an output layer with $C$ neurons (where $C$ is the number of classes).
Loss Function: BCEWithLogitsLoss (Binary Cross-Entropy with Logits). This is critical for multilabel classification, treating each class independently and applying a sigmoid activation to determine the presence of multiple sounds in a single clip.

D. Training

Optimizer: Adam (Learning rate: 0.001).
Batch Size: 16.
Epochs: 100.
Data Split: 70% Training, 20% Validation, 10% Testing.

3. Key Contributions

Novel Dataset Expansion: Introduction of an expanded SAS-KIIT dataset with 21 culturally specific South Asian classes, addressing the lack of diverse regional audio data.
Methodological Shift: Demonstration that Mel Spectrograms outperform traditional MFCCs for multilabel classification in overlapping sound environments. The visual nature of spectrograms allows the CNN to learn complex patterns without explicit source separation.
Robust Multilabel Framework: A lightweight, efficient CNN architecture capable of detecting multiple simultaneous sound sources (1–4 overlapping classes) with high accuracy.
Comparative Benchmarking: Rigorous validation against State-of-the-Art (SOTA) models like FACE and PANNs, proving that a simpler, custom architecture can outperform complex pre-trained models in this specific domain.

4. Results

The proposed Spectrogram-based CNN was evaluated against MFCC-based models and SOTA architectures (FACE, PANNs) on both fixed and variable mix datasets.

Performance Highlights:

SAS-KIIT (Fixed Mix):
- Spectrogram Model Accuracy: 95.42% (vs. 93.91% for MFCC).
- Spectrogram F1-Score: 0.81 (vs. 0.76 for MFCC).
SAS-KIIT (Variable Mix):
- Spectrogram Model Accuracy: 96.37% (vs. 94.63% for MFCC).
- Spectrogram F1-Score: 0.84.
UrbanSound8K (Variable Mix):
- Spectrogram Model Accuracy: 85.26% (vs. 83.94% for MFCC).
- Note: Performance was slightly lower on UrbanSound8K due to its higher class overlap complexity (visualized via t-SNE), but the spectrogram approach still outperformed MFCC.

Comparison with SOTA (SAS-KIIT Variable Mix):

Proposed Model: 96.37% Accuracy.
FACE (Complex Architecture): 95.22% Accuracy.
PANNs (Pre-trained): 92.51% Accuracy.
Conclusion: The proposed model achieved the highest accuracy while maintaining a simpler architecture and lower computational cost than FACE and PANNs.

5. Significance and Impact

Cultural Preservation: The framework enables the accurate documentation and analysis of South Asia's unique acoustic heritage, which is often lost in generic global models.
Urban Monitoring: The system is highly suitable for real-time urban surveillance, public safety, and anomaly detection in resource-constrained settings due to its efficiency.
Technical Validation: The study confirms that for overlapping soundscapes, visual feature representations (spectrograms) combined with CNNs are superior to statistical feature representations (MFCCs) when explicit source separation is not feasible.
Future Directions: The authors suggest integrating attention mechanisms or temporal sequence modeling (e.g., LSTM/Transformers) to further capture contextual dependencies and deploying the model on edge devices for real-time field analysis.

Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds

1. Problem Statement

2. Methodology

A. Datasets

B. Feature Extraction

C. Model Architecture

D. Training

3. Key Contributions

4. Results

5. Significance and Impact

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation