A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives

This paper provides a comprehensive survey of music generation research by categorizing systems across single-modal, cross-modal, and multi-modal perspectives, while examining key aspects such as representation, data alignment, datasets, evaluation methods, current challenges, and future directions.

Shuyu Li, Shulei Ji, Zihao Wang, Songruoyao Wu, Jiaxing Yu, Kejun Zhang

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are a chef trying to cook a perfect meal. In the past, chefs only had one recipe book (text) or one set of raw ingredients (audio) to work with. They could make a decent dish, but it was hard to capture the exact mood, the visual beauty, or the specific story they wanted to tell.

This paper is like a comprehensive cookbook review for a new generation of "AI Chefs." These AI chefs are learning to cook music not just from a recipe, but by looking at a picture of the dish, watching a video of the cooking process, reading a story about the flavor, and listening to a sample of the taste all at the same time.

Here is a breakdown of the paper's journey, explained simply:

1. The Evolution: From Solo to Symphony

  • The Solo Act (Single-Modal): Imagine a musician playing a piano. If they only listen to a recording of a piano and try to play a new song based on it, that's Single-Modal. It's like trying to paint a picture using only a single color. It works, but it's limited.
  • The Duet (Cross-Modal): Now, imagine the musician is given a poem and asked to write a song that fits the poem's mood. Or they are shown a picture of a storm and asked to compose thunderous music. This is Cross-Modal. The AI is translating one language (words or images) into another (music).
  • The Full Orchestra (Multi-Modal): This is the paper's main focus. Imagine the AI is given a video of a dancing couple, a text description saying "romantic but sad," and a sketch of a sunset. It has to combine all these clues to create a song that fits the dance, matches the words, and captures the sunset's colors. This is Multi-Modal Music Generation.

2. The Ingredients (Representations)

To cook this music, the AI needs to understand different "languages":

  • Audio (The Sound): This is the raw sound waves. It's like the actual sizzling of food. It's rich but messy and hard for computers to digest directly, so they compress it into "tokens" (like digital LEGO bricks).
  • Symbolic Music (The Sheet Music): This is the structured notes (MIDI, piano rolls). It's like a recipe card with exact measurements. It's precise but lacks the "soul" or the raw sound.
  • Text (The Story): This is the description. "A happy jazz song" or "Lyrics about heartbreak." The AI uses models (like the brains behind chatbots) to understand these words.
  • Images & Video (The Visuals): This is the mood board. A picture of a rainy street or a video of a dancer. The AI has to figure out how a visual "sadness" translates into a musical "minor key."

3. The Kitchen Tools (Models & Methods)

The paper reviews the tools these AI chefs use:

  • The Bridge Builders: Since "text" and "sound" speak different languages, the AI uses special bridges (like CLAP or MuLan) to translate them into a shared space where they can understand each other.
  • The Generators:
    • Autoregressive Models: Like writing a sentence one word at a time, predicting the next note based on the previous one.
    • Diffusion Models: Imagine starting with a cloud of static noise and slowly sculpting it into a clear melody, refining it step-by-step until it sounds perfect.
    • Transformers: The "super-brains" that look at the whole picture (or video) at once to understand the context.

4. The Pantry (Datasets)

You can't cook without ingredients. The paper points out a major problem: We don't have enough high-quality, multi-ingredient recipes.

  • Most existing data is just "Song + Lyrics" or "Video + Music."
  • We are missing massive libraries where a video, a text description, a sketch, and the music are all perfectly synced.
  • The authors suggest we need to build bigger, better "pantries" (datasets) and find clever ways to use existing single-ingredient data to train these complex models.

5. The Taste Test (Evaluation)

How do we know if the AI's music is good?

  • The Robot Judge (Objective Metrics): Computers measure things like "Does the rhythm match the video?" or "Is the pitch distribution similar to real music?" It's like a machine measuring the temperature and weight of a cake.
  • The Human Judge (Subjective Metrics): Ultimately, music is art. Humans have to listen and say, "Does this make me feel sad?" or "Does this sound like a real band?" The paper notes that we need better ways to combine these human feelings with robot measurements.

6. The Future Menu (Challenges & Directions)

Even though these AI chefs are getting good, they still have a long way to go:

  • Creativity: Currently, the AI is a bit of a copycat. It tends to remix what it's already heard. We need it to be truly creative and invent new sounds.
  • Efficiency: Cooking a symphony takes a long time and a lot of computer power. We need faster ways to generate music so it can happen in real-time (like for a video game).
  • The "Uncanny Valley": The music often sounds "almost" human but has a slight glitch. We need to fix the quality so it sounds professional.
  • Alignment: Sometimes the AI gets the rhythm right but the emotion wrong. We need it to understand the whole picture, not just the parts.

The Bottom Line

This paper is a map for the future of music. It tells us that while we have amazing tools to turn text, images, and videos into music, we are still in the early stages. The goal is to create an AI that doesn't just play notes, but understands the feeling of a sunset, the energy of a dance, and the story in a poem, and weaves them all into a perfect, original song.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →