Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

This paper introduces Noise-to-Notes (N2N), a state-of-the-art diffusion-based framework that redefines automatic drum transcription as a conditional generative task, utilizing an Annealed Pseudo-Huber loss for joint optimization and music foundation model features to achieve superior robustness and performance across multiple benchmarks.

Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are sitting in a recording studio, listening to a complex drum solo. Your job is to write down exactly which drum was hit, when it was hit, and how hard it was hit. This is what Automatic Drum Transcription (ADT) tries to do for computers.

For a long time, computers tried to solve this like a detective. They would look at the sound wave (the "clues") and try to guess, "Is this a snare? Is this a kick?" This is called a discriminative approach. It works okay, but it often gets confused when the drums sound different from what it learned in school, or when the music is messy.

This paper introduces a new way to think about the problem, called Noise-to-Notes (N2N). Instead of being a detective, the computer becomes an artist who paints a picture based on a vague sketch.

Here is how it works, broken down into simple concepts:

1. The "Denoising" Artist (Diffusion Models)

Think of a beautiful drum transcription (a sheet of music) as a clear, high-definition photo.

  • The Old Way: The computer looks at a photo and tries to guess what it is.
  • The New Way (N2N): Imagine taking that clear photo and slowly turning it into static noise (like TV snow) until you can't see anything.
    • The N2N model is trained to do the reverse. You give it a screen full of "TV snow" (random noise) and a recording of the drums.
    • The model acts like an artist looking at the snow and the audio, slowly "cleaning" the noise, step-by-step, until a clear musical score emerges from the chaos.
    • Why is this cool? Because it's a generative process, the model can "fill in the blanks." If you cut out a chunk of the audio (like a missing drum fill), the model can look at the surrounding context and imagine what the missing drums should have been, just like an artist finishing a sketch.

2. The "Goldilocks" Math Problem (Annealed Pseudo-Huber Loss)

The computer has to predict two things at once:

  1. Onset: Did the drum hit? (Yes/No)
  2. Velocity: How hard was it hit? (Soft to Loud)

This is tricky. If the computer uses standard math to learn, it gets obsessed with getting the "Yes/No" right and ignores the "How hard." It's like a student who memorizes the dates of battles but forgets why they happened.

The authors invented a special math tool called Annealed Pseudo-Huber Loss.

  • The Analogy: Imagine you are teaching a child to draw.
    • At the beginning of training, you are very strict about the big shapes (the "Yes/No" hits).
    • As the child gets better, you gradually shift your focus to the details (the "How hard" velocity).
    • This "Annealed" (slowly changing) approach helps the model master both the big picture and the fine details without getting confused.

3. The "Super-Brain" Helper (Music Foundation Models)

Usually, computers look at sound waves like a flat map (a spectrogram). It's like looking at a drum kit from directly above; you can see the shapes, but you can't feel the texture or the "vibe."

The authors added a Music Foundation Model (MFM) to the mix.

  • The Analogy: Think of the standard sound map as a black-and-white photo. The MFM is like a 3D hologram that understands the "meaning" of the sound.
  • It knows that a snare drum in a jazz club sounds different from a snare drum in a rock stadium, even if the raw sound waves look similar.
  • By giving the computer this "super-brain" helper, it becomes much better at recognizing drums in songs it has never heard before (out-of-domain data).

4. The Results: Speed vs. Accuracy

Because this is a "step-by-step" cleaning process, it takes a little longer than the old "detective" methods.

  • 1 Step: Fast, but a bit rough (like a quick sketch).
  • 10 Steps: Slower, but incredibly accurate (like a finished painting).

The paper shows that even with just a few steps, N2N beats the best existing methods. It is the first time a "generative" model (the artist) has beaten a "discriminative" model (the detective) in this specific field.

Summary

Noise-to-Notes is a new system that turns drum transcription into an art project. Instead of just guessing what a drum hit is, it starts with random noise and slowly sculpts the perfect drum score based on the audio. By using a special math trick to balance accuracy and a "super-brain" helper to understand different drum styles, it creates the most accurate drum transcriptions we've ever seen, and it can even fill in missing parts of a song like a magic trick.