Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

Imagine you are a chef trying to teach a robot how to cook the perfect steak. The problem is, you only have one tiny, slightly burnt piece of steak to study, and you can't go to the butcher shop to buy more because it's too expensive and the meat is often covered in dirt.

This is exactly the problem car engineers face with engine sounds. They need thousands of hours of clean, perfect engine recordings to teach computers how to understand or recreate engine noises (for things like virtual reality driving games or electric cars that need to sound like they have an engine). But real recordings are expensive, messy (full of wind and road noise), and often lack precise data on exactly how fast the engine was spinning or how hard it was working at every single moment.

Robin Doerfler and Lonce Wyse have built a "digital time machine" and a "sound cloning machine" to solve this. Here is how their paper works, broken down into simple steps:

1. The "Magic Prism" (Analysis)

First, they take a small, real recording of a car engine (about 5 to 10 minutes long). Instead of just listening to the noise, they use a special digital prism (a mathematical tool) to break the sound down into its individual building blocks.

The Analogy: Imagine the engine sound is a complex choir. Most people hear a blur of noise. This tool separates the choir into individual singers: the bass notes, the tenors, the altos, and even the background hum.
The Trick: Engines change pitch as they speed up or slow down, which usually messes up the analysis. The authors use a clever trick called "pitch-adaptive resampling." Think of it like a rubber ruler that stretches and shrinks automatically to keep the engine's "heartbeat" (the RPM) steady while they measure it. This lets them see the exact shape of every note, even as the car accelerates.

2. The "Sound Lego" Kit (Synthesis)

Once they have mapped out the "singers" (the harmonics) and the "background noise" (the roar), they build a digital synthesizer. This isn't just a simple beep-boop machine; it's a highly sophisticated sound Lego kit.

The Harmonics: They create 128 different "sine wave" oscillators (pure tones) that act as the engine's voice.
The Noise: Real engines aren't perfect; they have a rumble, a hiss, and a pop. The system adds "pink noise" (a smooth, static-like sound) and "burst noise" (sharp pops from valves) to make it sound alive.
The Resonance: Just like a guitar body amplifies sound, a car's exhaust pipe does too. They added a "feedback delay network" to mimic how the exhaust pipe echoes and colors the sound.

3. The "Invisible Ink" (Embedded Annotations)

This is the most unique part. Usually, if you want to know the speed of the engine in a recording, you need a separate text file with a spreadsheet of numbers. If that file gets lost, the data is useless.

The authors solved this by hiding the data inside the sound itself.

The Analogy: Imagine a song where the lyrics are sung normally, but the volume of the singer's voice is secretly encoding a secret message in Morse code.
How it works: They encode the exact RPM (speed) and Torque (force) into two extra audio channels that are part of the file. You can play the file on a standard speaker, and it sounds like a car. But if you plug it into their software, it can "read" the hidden channels and instantly know the exact operating conditions of the engine at every single millisecond. It's sample-accurate ground truth.

4. The Result: A Massive Library

Using this method, they took a few minutes of real recordings and expanded them into a massive library called the Procedural Engine Sounds Dataset.

The Scale: They turned 5–10 minutes of source material into 19 hours of new, clean audio (5,935 files).
The Variety: They didn't just copy-paste; they mixed and matched the "singers" and "noise" to create thousands of different driving scenarios, from idling at a stoplight to screaming down a highway.

Why Does This Matter?

Think of this as giving AI a gym with infinite weights.

Before: Researchers had to train AI on a few messy, expensive recordings. The AI would get confused or memorize the specific car it was trained on.
Now: Researchers can train AI on this huge, clean, perfectly labeled dataset. The AI learns the rules of how engines sound, not just the specific sound of one car.

In short: They built a system that can take a tiny, imperfect recording of a car engine, understand its DNA, and then grow a massive, perfect forest of engine sounds, all while hiding the "instruction manual" (the speed and force data) directly inside the audio file. This allows engineers to build better virtual cars, diagnose engine problems automatically, and create realistic sounds for movies and games without needing to drive around recording for years.

Here is a detailed technical summary of the paper "Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations."

1. Problem Statement

The automotive audio industry (active sound design, virtual prototyping, NVH control) and emerging data-driven synthesis methods require large-scale, standardized engine audio datasets. However, existing resources face three critical limitations:

Scarcity & Cost: High-quality, clean recordings are expensive to acquire and often contaminated by environmental/mechanical noise.
Lack of Precision: Existing public datasets typically offer only coarse or missing temporal annotations for operating parameters (RPM, torque), making them unsuitable for precise machine learning tasks.
Inflexibility: Real-world recordings cannot be systematically augmented or modified under controlled conditions, limiting the ability to test algorithms across specific scenarios.

2. Methodology

The authors propose an analysis-driven procedural synthesis framework that converts limited real-world recordings into a massive, perfectly annotated synthetic dataset. The pipeline consists of three core components:

A. Spectral Analysis Pipeline (Feature Extraction)

The system extracts acoustic fingerprints from real recordings (sampled at 16 kHz) using a specialized spectral analysis approach:

Pitch-Adaptive Preprocessing: Audio frames are resampled using cubic spline interpolation based on instantaneous RPM. This "warps" the time axis to stabilize the fundamental frequency ( $f_0$ ), preventing harmonic drift during analysis.
Frequency-Aligned FFT: To minimize spectral leakage, the FFT size is dynamically calculated so that frequency bins align perfectly with expected engine orders (harmonics of the crankshaft rotation).
Centroid-Based Harmonic Tracking: Instead of simple peak picking, the system calculates spectral centroids within regions bounded by adjacent harmonics. This extracts:
- Harmonic Deviations ( $\delta_h$ ): Quantifying inharmonicity (frequency shifts) caused by mechanical coupling or combustion irregularities.
- Magnitude Envelopes ( $\hat{M}_h$ ): The amplitude of each harmonic as a function of RPM and torque.

B. Parametric Synthesis Model

The extracted parameters drive a Harmonic-Plus-Noise (H+N) synthesizer with resonator modeling:

Additive Synthesis: 128 sine-wave oscillators generate the harmonic structure. Frequencies are modulated by the extracted inharmonicity factors ( $\delta_h$ ) and amplitudes by the magnitude envelopes.
Noise Synthesis:
- Turbulence: Pink noise amplitude-modulated onto the harmonic sum to simulate combustion pressure fluctuations.
- Bursts: Filtered white noise modulated by low-order harmonic envelopes to simulate valve events and intake resonances.
Resonator Modeling: A bank of parallel feedback delay networks simulates exhaust system resonances, adding timbral realism and variability.

C. Synchronized Multi-Channel Encoding

A unique feature of this framework is the embedded control annotation. The system generates a 4-channel 48 kHz audio file:

Channels 1–2: Stereo engine audio.
Channels 3–4: Encoded control parameters (RPM and Torque) normalized to [-1, 1] and stored as 16-bit audio data.
Benefit: This allows for sample-accurate ground truth reconstruction directly from the audio stream without external metadata files.

3. Key Contributions

The Framework: A novel signal-processing pipeline that extracts order-dependent harmonic deviations and magnitude envelopes to drive a controllable, realistic synthesizer.
The Dataset (Procedural Engine Sounds Dataset):
- Scale: 19.0 hours of audio comprising 5,935 files (24.5 GB).
- Coverage: Derived from 4 vehicle configurations, covering RPM (0–7,007) and Torque (-107 to 718 Nm) across diverse driving scenarios (acceleration, cruising, idle, gear shifts).
- Augmentation: Achieves a 15–30x data expansion from minimal source material (5–10 minutes per vehicle).
- Annotation: All files contain sample-accurate, embedded RPM and torque labels.
Validation: Demonstrated that the synthetic data preserves characteristic engine-order structures and is suitable for training deep learning models.

4. Results & Validation

Acoustic Authenticity: Comparisons between real recordings and synthetic outputs (Figure 1) show high coherence in engine-order magnitude distributions (e.g., dominant 4th order for V8 firing). The system successfully reproduces essential acoustic descriptors while allowing parametric variation in higher orders for diversity.
Machine Learning Suitability: A baseline differentiable neural network (1.4M parameters) was trained to reconstruct audio solely from RPM and torque inputs.
- The model achieved stable convergence with minimal train-validation gap.
- Successful reconstruction confirmed that the embedded annotations capture the complete Operating State $\to$ Acoustic relationship.
- The dataset supports progressive benchmarking through intentional parametric modifications in resonator and noise characteristics.

5. Significance

This work addresses a critical bottleneck in automotive audio research by providing a scalable, clean, and perfectly annotated dataset that is otherwise impossible to obtain via traditional recording methods.

For Research: It enables the development of inverse parameter estimation models (predicting RPM/torque from audio for NVH diagnostics) and data-driven synthesis systems without manual tuning.
For Industry: It offers a cost-effective method to generate vast training corpora for active sound design and virtual prototyping.
Reproducibility: The framework allows researchers to apply the same analysis-synthesis pipeline to their own limited recordings to generate task-specific datasets.

The dataset is publicly available via Zenodo and Hugging Face, supporting future advancements in engine timbre analysis and neural generative networks.