Multi-Stage Music Source Restoration with BandSplit-RoFormer Separation and HiFi++ GAN

This paper presents the CP-JKU team's two-stage system for the ICASSP 2025 Music Source Restoration Challenge, which combines a curriculum-trained BandSplit-RoFormer model for separating eight stems and a specialized HiFi++ GAN for restoring instrument-specific waveforms from mastered audio.

Tobias Morocutti, Emmanouil Karystinaios, Jonathan Greif, Gerhard Widmer

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you have a delicious, complex stew that someone has already cooked, seasoned, and served in a bowl. Now, imagine you want to take that stew apart and get back the original, raw ingredients (the carrots, the beef, the herbs) exactly as they were before they were cooked.

That is essentially what Music Source Restoration (MSR) tries to do. But instead of a stew, it's a finished song. And instead of just separating ingredients, it has to undo all the "cooking" (mixing, mastering, compression, and digital glitches) that happened to the song.

Here is how the team at Johannes Kepler University (CP-JKU) solved this puzzle for the 2025 challenge, explained simply:

The Big Problem: The "Over-Processed" Song

In the music industry, songs aren't just a simple mix of instruments playing together. They go through a "production pipeline" that adds effects like echo, compression (making loud parts quieter and quiet parts louder), and distortion. They also get compressed for streaming (like MP3s), which adds digital noise.

Trying to separate these instruments is like trying to un-bake a cake to get the raw eggs and flour back. Standard methods fail because the ingredients have been chemically changed by the "cooking" process.

The Solution: A Two-Stage Assembly Line

The team built a robot factory with two distinct stations to handle this messy job.

Stage 1: The "Rough Slicer" (Separation)

First, they need to cut the finished song into eight rough piles: Vocals, Guitar, Drums, Bass, Keyboards, Synthesizers, Percussion, and Orchestra.

  • The Tool: They used a smart AI called BandSplit-RoFormer. Think of this AI as a master chef who looks at the frequency "flavors" of the song and tries to guess which ingredient belongs to which pile.
  • The Training Trick (The Curriculum): You can't just throw the AI into the deep end. They taught it in three steps:
    1. Warm-up: They started with a simple version of the AI that only knew how to separate 4 things (Vocals, Drums, Bass, and "Everything else") using clean, perfect recordings.
    2. Practice: They then taught it to handle "messy" recordings that had been processed and degraded, just like real songs.
    3. The Expansion: Finally, they gave the AI a "head transplant." They kept the brain it had learned and added new "heads" (specialized outputs) to handle the other 4 instruments (Guitar, Keys, etc.). This allowed them to teach the AI to handle 8 instruments without forgetting how to handle the first 4.

Result: The AI now produces 8 rough, slightly imperfect stems. They are separated, but they still sound a bit "muddy" or damaged because the AI made some mistakes.

Stage 2: The "Polishing Station" (Restoration)

Now that the AI has 8 rough piles, they need to clean them up and make them sound like the original, pristine recordings.

  • The Tool: They used a HiFi++ GAN. Imagine a high-end photo editor that doesn't just fix a blurry photo, but imagines what the sharp details should look like and paints them in.
  • The "Generalist to Expert" Strategy:
    1. The Generalist: First, they trained one super-AI to clean up any instrument. It learned to remove noise, fix distortion, and sharpen the sound.
    2. The Specialists: Then, they took that Generalist and created 8 "Experts." One Expert only knows how to fix Vocals, another only Drums, etc.
    3. The Secret Sauce: To make these Experts perfect, they didn't train them on clean songs. They trained them on the rough, messy outputs from Stage 1. This is like training a doctor to fix a broken leg by showing them a cast that is already slightly cracked, so they learn exactly how to fix the specific kind of damage the "Rough Slicer" creates.

The Results

When they tested this system, it worked surprisingly well.

  • The Score: It achieved high marks in separating the instruments and making them sound natural.
  • The Human Test: When humans listened to the results, they rated the quality very highly (a "MOS" of 3.55 out of 5, which is very good for AI).

The Limitations (Where it still struggles)

Even with this fancy two-stage factory, the system isn't perfect yet.

  • Noisy Recordings: If the original song is very old, recorded in a live concert, or has a lot of background noise, the "Rough Slicer" gets confused. If the first stage fails to separate the ingredients correctly, the "Polishing Station" can't fix it.
  • The "Dry" Target: Sometimes, the original song had effects like reverb (echo) or delay. The AI is asked to remove "production effects," but it's hard to know if an echo was a mistake or part of the artist's vision. The AI sometimes accidentally removes cool artistic effects, making the song sound too "dry" or flat.

Summary

The CP-JKU team solved the problem by breaking it down:

  1. Separate first: Use a smart AI to split the song into 8 messy piles.
  2. Clean second: Use specialized AI experts to polish each pile, trained specifically to fix the mistakes the first AI makes.

It's a bit like hiring a team to first sort a pile of mixed laundry into 8 baskets, and then hiring 8 different ironing experts to fix the wrinkles in each specific type of fabric, knowing exactly how the sorting machine messed them up.