Detection and Classification of Cetacean Echolocation Clicks using Image-based Object Detection Methods applied to Advanced Wavelet-based Transformations

Imagine you are trying to listen to a conversation in a crowded, noisy room where everyone is whispering, shouting, and clapping. Now, imagine that conversation is happening underwater, where the "people" are killer whales, and the "clapping" is them using sonar to hunt and talk to each other.

This is the challenge Christopher Hauer tackled in his Master's thesis. He wanted to build a computer program that could automatically listen to hours of underwater recordings, find the specific sounds killer whales make (called "clicks"), and figure out which sounds are the whale talking and which are just the sound bouncing off a rock or the water surface (an "echo").

Here is the story of how he did it, broken down into simple concepts.

The Problem: The "Human Bottleneck"

Killer whales are social and use clicks constantly. Sometimes they click to hunt (like a sonar ping), and sometimes they click just to chat.

The Issue: To understand what the whales are saying, scientists need to listen to thousands of hours of audio and manually mark every single click.
The Reality: It takes a human expert about 12 hours to label just one minute of audio. If a whale clicks 150 times in a second, the human has to find, mark, and decide if it's a click or an echo for every single one. It's like trying to count grains of sand on a beach by picking them up one by one. It's impossible to do at scale.

The Solution: Teaching a Computer to "See" Sound

Christopher realized that computers are great at looking at pictures, but bad at listening to raw sound waves. So, he decided to turn the sound into pictures.

1. The Three Lenses (Waveform, Spectrogram, Scalogram)

Think of a sound recording as a piece of music.

The Waveform: This is like looking at the music sheet. It shows the volume going up and down over time.
The Spectrogram: This is like a heat map. It shows which notes (frequencies) are playing at which time. But, there's a catch: you can't see the exact time a note starts and the exact pitch it has at the same time perfectly (like trying to take a photo of a fast-moving car; if you focus on the speed, the car looks blurry).
The Scalogram (The Magic Lens): Christopher used a special mathematical trick called a "Wavelet Transform." Imagine this as a smart camera lens that automatically zooms in for fast, high-pitched sounds (like a click) and zooms out for slow, low-pitched sounds. This gave the computer a much clearer picture of the tiny, split-second clicks.

He combined these three views into a single colorful image (Red = Waveform, Green = Scalogram, Blue = Spectrogram). To the computer, a whale click didn't look like a sound; it looked like a bright, sharp cone sticking out of a noisy background.

The Detective Work: YOLO and the "Box"

Christopher used a famous AI system called YOLO (You Only Look Once).

The Analogy: Imagine YOLO is a security guard looking at a crowd. Its job is to draw a box around anyone who looks suspicious.
The Challenge: In the beginning, the guard was too lazy. He would draw one giant box around a whole group of people (a "burst" of clicks) instead of boxing each person individually.
The Fix: Christopher added a "post-processing" step. He used a mathematical tool (First Order Gradient) to find the exact "peaks" inside that giant box. It was like taking that giant box and slicing it up so that every single person got their own little box.

The Brain: The "Context" Detective

Here was the hardest part: How does the computer know if a sound is a Click or an Echo?

The Problem: An echo sounds almost exactly like the original click. It's like hearing your own voice in a canyon. If you only hear one sound, it's impossible to tell if it's you or the canyon.
The Human Trick: Biologists know that clicks usually come in a rhythmic pattern (like a drumbeat), while echoes are just random reflections.
The AI Trick: Christopher taught a second AI, called a Random Forest, to be a "Context Detective." Instead of looking at one sound in isolation, this detective looked at the neighborhood.
- Question: "Did a click happen 2 milliseconds ago?"
- Question: "Is this sound weaker than the one before it?"
- Question: "Does the pattern look like a rhythm?"
- Verdict: If the pattern fits, it's a Click. If it's just a random bounce, it's an Echo.

The Results: From "Maybe" to "Mostly Right"

Christopher tested his system, which he named CLICK-SPOT, against the old methods:

Old Methods (PAMGuard): Like a metal detector that beeps at every soda can and every coin. It found 39% of the real clicks correctly but was very confused.
The New System (CLICK-SPOT): It found 96% of the clicks correctly and could tell the difference between a click and an echo with about 82% accuracy.

Why This Matters

Before this, scientists had to spend years manually labeling data. Now, they can process that data much faster.

The Catch: The computer is currently slow. It takes 25 minutes of computer time to analyze just 1 minute of audio. It's not fast enough to listen to whales live while they are swimming by.
The Future: The goal is to make it faster so it can be used on a boat in real-time. Also, because the system learns to recognize "click shapes," it could potentially be retrained to listen to dolphins, sperm whales, or even other animals that use sonar.

In a nutshell: Christopher built a digital pair of eyes and a smart brain that can look at a messy underwater recording, find the tiny "pings" of killer whales, and figure out which ones are the whales talking and which ones are just echoes, saving scientists thousands of hours of work.

Detection and Classification of Cetacean Echolocation Clicks using Image-based Object Detection Methods applied to Advanced Wavelet-based Transformations

The Problem: The "Human Bottleneck"

The Solution: Teaching a Computer to "See" Sound

1. The Three Lenses (Waveform, Spectrogram, Scalogram)

The Detective Work: YOLO and the "Box"

The Brain: The "Context" Detective

The Results: From "Maybe" to "Mostly Right"

Why This Matters

1. Problem Statement

2. Methodology

A. Data Representation (SCWTSPEC)

B. Stage 1: Event Detection (YOLO)

C. Stage 2: Post-Processing (First Order Detection - FOD)

D. Stage 3: Classification (Random Forest)

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

Detection and Classification of Cetacean Echolocation Clicks using Image-based Object Detection Methods applied to Advanced Wavelet-based Transformations

The Problem: The "Human Bottleneck"

The Solution: Teaching a Computer to "See" Sound

1. The Three Lenses (Waveform, Spectrogram, Scalogram)

The Detective Work: YOLO and the "Box"

The Brain: The "Context" Detective

The Results: From "Maybe" to "Mostly Right"

Why This Matters

1. Problem Statement

2. Methodology

A. Data Representation (SCWTSPEC)

B. Stage 1: Event Detection (YOLO)

C. Stage 2: Post-Processing (First Order Detection - FOD)

D. Stage 3: Classification (Random Forest)

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Improvement of DVB-S2/S2X Performance Using External Synchronization

ospEDA: Orthogonal Subspace Projection for Electrodermal Activity Decomposition

IOGRUCloud: A Scalable AI-Driven IoT Platform for Climate Control in Controlled Environment Agriculture

On the Isospectral Nature of Minimum-Shear Covariance Control

Learning interpretable and stable dynamical models via mixed-integer Lyapunov-constrained optimization