ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Imagine you are a mechanic trying to listen to a machine to see if it's healthy or broken. Usually, you'd use a stethoscope. But here's the problem: some machines are huge and rumble slowly (low frequency), while others are tiny and whine quickly (high frequency). Furthermore, some machines run at 16,000 "beats" per second, while others run at 50,000.

Traditional AI models are like mechanics who only have one specific stethoscope. If the machine runs at a speed they haven't seen before, they have to either:

Slow it down or speed it up (resampling), which distorts the sound and loses important clues.
Cut off the beginning or end of the recording to make it fit their stethoscope, which throws away context.

Enter ECHO, a new "super-mechanic" AI that solves these problems. Here is how it works, using simple analogies:

1. The "Frequency-Splitting" Strategy (The Orchestra Analogy)

Most AI models look at a sound recording as one giant, messy block of noise. ECHO is smarter. It acts like a conductor separating an orchestra into sections.

The Problem: A violin and a tuba play different notes. If you try to analyze them with the same simple rule, you miss the details.
ECHO's Solution: It splits the sound into frequency sub-bands (like separating the violins, the drums, and the brass).
The Magic: It doesn't just split them; it gives each section a GPS tag (Frequency Positional Embedding). This tells the AI exactly where in the spectrum of sound this section belongs. Because of this GPS, ECHO understands that a "low rumble" on a slow machine is the same relative sound as a "low rumble" on a fast machine. It doesn't get confused by the speed of the machine; it understands the relationship of the sounds.

2. The "Sliding Patch" (The Moving Window Analogy)

Imagine you are trying to read a book, but your eyes can only focus on a tiny square of text at a time.

Old AI: It forces you to cut the book into fixed-size pages. If the book is too long, you chop off the last chapter. If it's too short, you add blank pages. This ruins the story.
ECHO's Solution: It uses a sliding window. Imagine a flashlight beam moving smoothly across the text, overlapping slightly with the previous spot.
- It can read a 1-second sentence or a 1-hour novel without chopping anything up.
- It captures the flow of the story (the time) perfectly, regardless of how long the signal is. This makes it perfect for streaming, where data comes in continuously, like a live radio broadcast.

3. The "Hierarchical Brain" (The Team of Experts)

ECHO doesn't just look at the sound; it organizes its thinking.

It sends each frequency section (the violin, the drums, etc.) to a specialized "mini-expert" (a Vision Transformer).
These mini-experts summarize their section and pass a "report card" (a CLS token) to a Head Coach.
The Head Coach combines all the report cards into one final summary. This allows ECHO to see the big picture (is the machine broken?) while still remembering the specific details of the high-pitched squeaks or low-pitched thumps.

Why Does This Matter? (The Results)

The authors tested ECHO on a massive collection of machine sounds, from factory vibrations to DCASE (a global competition for sound detection).

The Result: ECHO beat every other AI model, including the previous champions.
The Takeaway: It works better because it doesn't force the data to fit the model. Instead, the model is flexible enough to fit the data, whether the machine is running slow or fast, loud or quiet, for a second or an hour.

In a nutshell:
ECHO is like a universal translator for machine sounds. It doesn't care if the machine speaks "fast" or "slow," or if the recording is "short" or "long." It breaks the sound into meaningful chunks, tags them with location data, and slides its attention across the timeline to find the exact moment a machine starts to sound sick. This makes it a powerful tool for predicting factory failures before they happen.

Here is a detailed technical summary of the paper "ECHO: FREQUENCY-AWARE HIERARCHICAL ENCODING FOR VARIABLE-LENGTH SIGNALS."

1. Problem Statement

While pre-trained foundation models have achieved remarkable success in audio, vision, and language, their application to general machine signal modeling (acoustic, vibration, industrial sensor data) remains under-explored. Existing audio foundation models face two critical limitations when applied to real-world industrial monitoring:

Fixed Input Constraints: Current Vision Transformer (ViT)-based models rely on fixed-size spectrogram inputs and 2D positional embeddings derived from image processing. This forces variable-length signals to be truncated or interpolated, disrupting the temporal-spatial relationships between patches.
Sampling Rate Rigidity: Existing models are trained on fixed sampling rates. Inputs with different rates require resampling, which inevitably introduces information loss or aliasing, limiting the model's ability to generalize across diverse hardware configurations.

2. Methodology: The ECHO Framework

The authors propose ECHO (Frequency-Aware Hierarchical Encoding for Variable-Length Signals), a foundation model designed to handle arbitrary sampling rates and variable signal lengths without padding or cropping. The architecture consists of four key stages (illustrated in Fig. 1 of the paper):

A. Spectrogram Extraction

Inputs are processed via Short-Time Fourier Transform (STFT).
Key Innovation: The window length and hop length are defined in seconds rather than fixed sample counts. This ensures that the resulting spectrogram has a fixed number of time frames ( $T$ ) regardless of the input sampling rate ( $f_s$ ), making the temporal dimension independent of the sampling frequency.

B. Frequency-Aware Sub-band Splitting

The spectrogram is split along the frequency axis into $N$ non-overlapping sub-bands. The number of sub-bands scales proportionally with the sampling rate.
Positional Encoding: Unlike standard ViTs that use 2D spatial embeddings, ECHO computes a relative frequency positional embedding for each sub-band.
- The center frequency ( $f_c$ ) and normalized position ( $p$ ) are calculated based on the sampling rate.
- A sinusoidal positional encoding is applied to $p$ , ensuring that sub-bands at equivalent relative frequency positions share consistent embeddings across different sampling rates. This enables the model to understand spectral localization regardless of the absolute sampling rate.

C. Temporal Sliding Patch Extraction

To handle variable-length inputs, the model employs a sliding window strategy within each sub-band.
A window of length $L$ (equal to the sub-band width) slides along the time axis with a stride of $L/2$ (50% overlap).
This is implemented efficiently via 2D convolution, collapsing the frequency dimension and producing a sequence of patches. This design avoids the need for truncation or padding, making the model naturally extendable to streaming scenarios.

D. Hierarchical Encoding

Each sub-band's patch sequence is fed into a ViT backbone, preceded by a learnable [CLS] token.
The final representation is a concatenation of the [CLS] tokens from all sub-bands ( $z = [CLS_1, \dots, CLS_K]$ ). This hierarchical approach captures local temporal dependencies within sub-bands while maintaining global frequency awareness.

E. Training Strategy

Framework: A teacher-student self-supervised framework (adapted from EAT).
Objectives:
1. Global Alignment: Aligns the temporal mean of the teacher's outputs with the student's [CLS] token.
2. Frame-level Alignment: Aligns masked positions at the frame level.
Data: Trained on large-scale audio corpora (AudioSet, MTG-Jamendo, Freesound, etc.).

3. Key Contributions

Frequency-Aware Band-Splitting: A novel strategy that splits spectrograms into sub-bands with relative frequency positional embeddings, enabling explicit modeling of spectral context across arbitrary sampling rates.
Sliding Patch Design: A mechanism within sub-bands that supports variable-length inputs natively, eliminating the need for padding/cropping and enabling streaming inference.
Unified Representation Space: A scalable training framework capable of handling diverse machine signal modalities (acoustic and vibration) and sampling rates within a single model.
SIREN Benchmark: The authors introduced SIREN (SIgnal Representation EvaluatioN toolkit), an open-source benchmark comprising DCASE challenges (2020–2025) and industrial datasets (MAFAULDA, CWRU, IIEE, IICA) to rigorously evaluate general machine signal embeddings.

4. Experimental Results

The model was evaluated against strong baselines (BEATs, CED, EAT, Dasheng, FISHER) on the SIREN benchmark.

Overall Performance: ECHO achieved the State-of-the-Art (SOTA) with a mean score of 77.65% on the SIREN benchmark, outperforming the previous best (FISHER at 76.86%) by +0.79%.
Anomaly Detection (DCASE Tasks): ECHO showed consistent improvements across DCASE 2020–2025 tasks, achieving a mean AUC/pAUC of 62.11% (Small model), surpassing FISHER (61.00%).
Fault Classification: ECHO achieved 93.19% accuracy on fault classification tasks, ranking first among small-scale models.
Ablation Insights:
- Sliding Patches: Models using sliding patches (Dasheng, ECHO) outperformed those using fixed patch tokenization (BEATs, CED, EAT), confirming the superiority of sliding windows for variable-length signals.
- Band-Splitting: Band-splitting architectures (ECHO, FISHER) significantly outperformed conventional models in fault classification, demonstrating better robustness to sampling rate variations.
- Frequency Positioning: ECHO outperformed FISHER, attributed specifically to the integration of frequency positional encoding, which enhances cross-sub-band modeling.
- Scalability: Performance improved when scaling from the "Tiny" (5.5M params) to "Small" (22M params) version, indicating the architecture's scalability.

5. Significance

Industrial Applicability: ECHO addresses the "real-world" gap in foundation models by handling variable sampling rates and signal lengths without data preprocessing artifacts (resampling/padding).
Generalization: The model demonstrates strong transferability across heterogeneous machine types, operating conditions, and sensor modalities (sound and vibration), making it a robust tool for predictive maintenance.
Open Science: By releasing the ECHO model and the SIREN benchmark, the authors provide a standardized framework for future research in machine signal representation learning, moving beyond narrow domain-specific solutions.

In conclusion, ECHO represents a significant step forward in adapting foundation models for industrial IoT, offering a flexible, frequency-aware, and streaming-capable architecture that sets a new benchmark for anomaly detection and fault classification.