Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

Imagine you are sitting in a recording studio, listening to a complex drum solo. Your job is to write down exactly which drum was hit, when it was hit, and how hard it was hit. This is what Automatic Drum Transcription (ADT) tries to do for computers.

For a long time, computers tried to solve this like a detective. They would look at the sound wave (the "clues") and try to guess, "Is this a snare? Is this a kick?" This is called a discriminative approach. It works okay, but it often gets confused when the drums sound different from what it learned in school, or when the music is messy.

This paper introduces a new way to think about the problem, called Noise-to-Notes (N2N). Instead of being a detective, the computer becomes an artist who paints a picture based on a vague sketch.

Here is how it works, broken down into simple concepts:

1. The "Denoising" Artist (Diffusion Models)

Think of a beautiful drum transcription (a sheet of music) as a clear, high-definition photo.

The Old Way: The computer looks at a photo and tries to guess what it is.
The New Way (N2N): Imagine taking that clear photo and slowly turning it into static noise (like TV snow) until you can't see anything.
- The N2N model is trained to do the reverse. You give it a screen full of "TV snow" (random noise) and a recording of the drums.
- The model acts like an artist looking at the snow and the audio, slowly "cleaning" the noise, step-by-step, until a clear musical score emerges from the chaos.
- Why is this cool? Because it's a generative process, the model can "fill in the blanks." If you cut out a chunk of the audio (like a missing drum fill), the model can look at the surrounding context and imagine what the missing drums should have been, just like an artist finishing a sketch.

2. The "Goldilocks" Math Problem (Annealed Pseudo-Huber Loss)

The computer has to predict two things at once:

Onset: Did the drum hit? (Yes/No)
Velocity: How hard was it hit? (Soft to Loud)

This is tricky. If the computer uses standard math to learn, it gets obsessed with getting the "Yes/No" right and ignores the "How hard." It's like a student who memorizes the dates of battles but forgets why they happened.

The authors invented a special math tool called Annealed Pseudo-Huber Loss.

The Analogy: Imagine you are teaching a child to draw.
- At the beginning of training, you are very strict about the big shapes (the "Yes/No" hits).
- As the child gets better, you gradually shift your focus to the details (the "How hard" velocity).
- This "Annealed" (slowly changing) approach helps the model master both the big picture and the fine details without getting confused.

3. The "Super-Brain" Helper (Music Foundation Models)

Usually, computers look at sound waves like a flat map (a spectrogram). It's like looking at a drum kit from directly above; you can see the shapes, but you can't feel the texture or the "vibe."

The authors added a Music Foundation Model (MFM) to the mix.

The Analogy: Think of the standard sound map as a black-and-white photo. The MFM is like a 3D hologram that understands the "meaning" of the sound.
It knows that a snare drum in a jazz club sounds different from a snare drum in a rock stadium, even if the raw sound waves look similar.
By giving the computer this "super-brain" helper, it becomes much better at recognizing drums in songs it has never heard before (out-of-domain data).

4. The Results: Speed vs. Accuracy

Because this is a "step-by-step" cleaning process, it takes a little longer than the old "detective" methods.

1 Step: Fast, but a bit rough (like a quick sketch).
10 Steps: Slower, but incredibly accurate (like a finished painting).

The paper shows that even with just a few steps, N2N beats the best existing methods. It is the first time a "generative" model (the artist) has beaten a "discriminative" model (the detective) in this specific field.

Summary

Noise-to-Notes is a new system that turns drum transcription into an art project. Instead of just guessing what a drum hit is, it starts with random noise and slowly sculpts the perfect drum score based on the audio. By using a special math trick to balance accuracy and a "super-brain" helper to understand different drum styles, it creates the most accurate drum transcriptions we've ever seen, and it can even fill in missing parts of a song like a magic trick.

Here is a detailed technical summary of the paper "Noise-to-Notes: Diffusion-Based Generation and Refinement for Automatic Drum Transcription."

1. Problem Statement

Automatic Drum Transcription (ADT) aims to convert drum audio recordings into symbolic representations (onset times and velocities). Traditionally, this has been formulated as a discriminative task (e.g., using Convolutional Recurrent Neural Networks or CRNNs) to predict drum events from spectrograms.

However, existing discriminative approaches face several limitations:

Generalization Issues: They struggle to generalize across different datasets (out-of-domain performance) due to the lack of harmonic structure in drum audio and significant spectral overlap between instruments.
Optimization Challenges: Jointly predicting binary onsets (hit/no hit) and continuous velocity values (dynamics) is difficult. Standard loss functions (like MSE) often cause onset errors to dominate the optimization, leading to poor velocity estimation.
Lack of Flexibility: Discriminative models generally cannot perform "inpainting" (generating missing parts of a transcription) or unconditional generation (creating drum patterns without audio input).

2. Methodology: Noise-to-Notes (N2N)

The authors propose N2N, the first generative framework for ADT, which reframes the task as a conditional diffusion model.

A. Core Framework

Task Formulation: The transcription is represented as a tensor $x \in \mathbb{R}^{F \times D \times 2}$ , where $F$ is frames, $D$ is drum components, and the last dimension represents onset (rescaled to $[-1, 1]$ ) and velocity (rescaled to $[-1, 1]$ ).
Diffusion Process: The model learns to reverse a forward process where Gaussian noise is added to clean transcriptions. It conditions the denoising process on:
1. Audio Features: Log mel-spectrograms.
2. Music Foundation Model (MFM) Features: Intermediate features extracted from MERT (a large-scale self-supervised music model), which capture high-level semantic information.
3. Timestep: The current noise level.

B. Key Architectural Components

Architecture: Based on the EDGE transformer decoder. It utilizes FiLM (Feature-wise Linear Modulation) layers and Cross-Attention to integrate audio conditions and timestep information.
Feature Dropout Strategy: To enable robust generation capabilities:
- Partial Dropout: Random contiguous subsequences of features are dropped to train the model for inpainting (filling in missing audio segments).
- Complete Dropout: Entire feature sets are dropped to enable unconditional generation (creating drum patterns from noise).
- Null Embedding: Replaces dropped features to distinguish between "no noise" and "missing data."

C. Loss Function Innovation: Annealed Pseudo-Huber (APH)

A major contribution is the Annealed Pseudo-Huber Loss ( $\mathcal{L}_{APH}$ ).

Problem: Standard MSE loss penalizes large errors heavily, causing the model to prioritize onset accuracy over velocity, or vice versa, depending on the scale. Fixed Pseudo-Huber constants ( $c$ ) were found to hinder optimization.
Solution: The APH loss dynamically schedules the constant $c(t)$ $c (t)$ during training.
- Early Training: Behaves like MSE (sensitive to large errors) to establish structure.
- Late Training: Behaves like MAE (robust to outliers) to refine continuous velocity values.
- Formula: $\mathcal{L}_{APH}(x, \hat{x}) = \sqrt{\|x - \hat{x}\|^2 + c(t)^2} - c(t)$ , where $c(t)$ is linearly annealed from $c_{max}$ to $c_{min}$ .

3. Key Contributions

Generative Paradigm Shift: First successful application of diffusion models to ADT, demonstrating that generative models can outperform discriminative baselines.
Novel Loss Function: The introduction of the Annealed Pseudo-Huber loss, which effectively solves the joint optimization problem for binary onsets and continuous velocities.
MFM Integration: Demonstrated that incorporating features from Music Foundation Models (MERT) significantly improves robustness to out-of-domain data compared to using spectrograms alone.
Advanced Capabilities: The framework enables inpainting (reconstructing missing audio segments) and unconditional generation, capabilities absent in traditional discriminative ADT.

4. Experimental Results

The model was evaluated on three benchmarks: E-GMD (in-domain), IDMT, and MDB (out-of-domain).

State-of-the-Art Performance: N2N achieved the best F1 scores across all datasets.
- E-GMD: 89.68 (Onset), 82.80 (Velocity) with 10 sampling steps.
- IDMT: 94.90 (Onset), 93.74 (Velocity).
- MDB: 87.86 (Onset), 86.66 (Velocity).
Comparison: N2N significantly outperformed the previous best discriminative model (OaF Drums) and transformer baselines (hFT-Transformer), particularly in velocity estimation and cross-dataset generalization.
Ablation Studies:
- Loss Function: APH loss significantly improved velocity scores (79.10 vs 66.14 with MSE) without sacrificing onset accuracy.
- Features: Using MFM features alone improved out-of-domain performance, but combining them with spectrograms yielded the best results. t-SNE visualizations confirmed MFM features provide distinct semantic separation for different drum kits.
Speed-Accuracy Trade-off: While diffusion models typically require many steps, N2N showed strong performance even at 5 steps (86.66 F1 on MDB), with saturation occurring around 10 steps. Although inference is slower than discriminative models due to model size and MFM extraction, the flexibility of the generative approach is a distinct advantage.

5. Significance

This work establishes a new benchmark for Automatic Drum Transcription by proving that generative diffusion models can surpass discriminative models in music transcription tasks.

Robustness: The integration of MFM features addresses the long-standing issue of poor generalization to unseen drum kits and recording environments.
Flexibility: By treating transcription as a generative process, the system gains the ability to refine noisy inputs and reconstruct missing data, opening new avenues for music restoration and creative tools.
Future Direction: The authors suggest future work will focus on bridging the inference speed gap via distillation and extending the approach to multi-instrument transcription.