Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Imagine you are standing in front of a beautiful, abstract painting. You feel a certain emotion: maybe it's chaotic, maybe it's calm, maybe it feels like a rainy Tuesday in a jazz club. Now, imagine if you could press a button and instantly hear a song that perfectly captures that exact feeling, without anyone ever having to write down a description of the painting first.

That is the dream behind Art2Mus, a new technology described in this paper.

Here is the story of how they built it, explained without the heavy jargon.

The Problem: The "Translator" Bottleneck

Before this paper, if you wanted to turn a picture into music, computers had to play a game of "Telephone."

Step 1: The computer looked at the painting.
Step 2: It had to ask an AI, "What does this look like?" and get a text description (e.g., "A sad blue landscape with swirling clouds").
Step 3: It took that text and asked a music AI to write a song based on the words.

The Flaw: This is like trying to describe a complex symphony to a friend using only emojis. You lose the nuance. When the computer turns a painting into words, it throws away the "vibe," the texture, and the subtle colors that are hard to describe in sentences. It's a shortcut that simplifies the art too much.

Also, most of these systems were trained on photos of cats and cars, not on famous paintings by Van Gogh or Picasso. They didn't understand the "soul" of art.

The Solution: Art2Mus (The Direct Line)

The researchers, led by Ivan Rinaldi and his team, decided to cut out the middleman. They built Art2Mus, a system that connects the painting directly to the music, skipping the text description entirely.

Think of it like this:

Old Way: Painting $\rightarrow$ Translator (Text) $\rightarrow$ Music.
Art2Mus Way: Painting $\rightarrow$ Music.

It's like the computer is learning to "feel" the painting and hum a tune that matches that feeling, rather than reading a menu and ordering a dish.

The Ingredients: ArtSound

To teach a computer to do this, you need a massive library of examples. You can't just show it one painting and one song; you need thousands.

The team created a new dataset called ArtSound.

They gathered 105,884 pairs of famous artworks and music tracks.
They used a clever trick to match them: they didn't ask humans to say "This song fits this painting." Instead, they used AI to look at the math inside the images and the math inside the audio. If the "vibe" of the math looked similar, they paired them up.
They also added "captions" (descriptions) to help check the quality, but the final music generator doesn't actually read them. It learns from the pictures directly.

How It Works: The "Magic Bridge"

The system uses a type of AI called a Diffusion Model. Imagine a noisy radio static that slowly clears up to reveal a clear song.

The researchers took a pre-existing music generator (which was already very good at making music from text).
They built a special bridge (called the "Visual Conditioning Extractor") that translates the "language" of images into the "language" of music.
Instead of feeding the system a sentence like "Make it sad," they feed it the image of a sad painting. The bridge translates the colors and shapes into musical instructions.

The Results: Does It Work?

The team tested Art2Mus against the old "Translator" methods.

The Good News: The music it creates sounds real and matches the mood of the painting surprisingly well. If you show it a chaotic, red abstract painting, the music tends to be fast and intense. If you show it a calm blue landscape, the music is slow and gentle.
The Reality Check: The old "Translator" methods (using text) still win slightly on perfect alignment. Why? Because humans are really good at describing things with words. But the researchers argue that Art2Mus is doing a much harder job. It's trying to understand art without the crutch of language.

Why This Matters

This is a big step forward for Creative AI.

For Artists: Imagine a musician who can look at a sketch and instantly hear a melody that fits it.
For Museums: Imagine walking through an art gallery where the music changes to match the painting you are looking at, creating a fully immersive experience.
For Science: It proves that computers can learn to connect two very different things (sight and sound) without needing a human to explain the connection in words first.

In a nutshell: Art2Mus teaches computers to listen to art, not just read about it. It's a direct line from the eye to the ear, skipping the boring part where the computer has to write a report first.

1. Problem Statement

The paper addresses the challenge of direct artwork-to-music generation, a task where a model synthesizes music based solely on the visual features of an artwork without intermediate text descriptions. The authors identify two critical limitations in existing image-conditioned music generation systems:

Data Domain Mismatch: Most existing models are trained on natural photographs, which lack the rich semantic, stylistic, and cultural nuances inherent in fine art (e.g., composition, brushwork, historical context).
The "Semantic Bottleneck": Current pipelines typically rely on an Image-to-Text-to-Music approach. By converting images into text captions first, these systems introduce a linguistic bottleneck that discards non-verbalizable visual cues (texture, style, atmosphere) and forces the model to learn indirect relationships mediated by language rather than direct visual-audio correspondences.

2. Key Contributions

The paper makes three primary contributions to the field of multimodal generative AI:

ArtSound Dataset: The introduction of a large-scale, curated dataset containing 105,884 artwork–music pairs. Unlike previous small-scale collections, this dataset is enriched with dual-modality captions (image and audio) and validated using custom alignment metrics. It is derived from ArtGraph (artworks) and the Free Music Archive (music).
Art2Mus Framework: The first framework explicitly designed for direct visual-to-audio generation. It eliminates the text intermediate, mapping digitized artworks directly to the conditioning space of a latent diffusion model.
Novel Alignment Strategy: The proposal of a Visual Conditioning Extractor and Image Aligner that projects visual embeddings (from CLIP or ImageBind) directly into the "Language of Audio" (LoA) token space of a pre-trained diffusion model, enabling the model to learn cross-modal correspondences without linguistic supervision.

3. Methodology

A. Data Construction (ArtSound)

Sources: 116,475 artworks from ArtGraph (WikiArt/DBpedia) and 106,574 tracks from the Free Music Archive (FMA).
Captioning Pipeline:
- Image Captions: Generated using LLaVA (MLLM) with structured prompts to capture content, mood, and style.
- Audio Captions: Generated using LP-MusicCaps for 10-second segments, then fused into coherent descriptions using Qwen3 (LLM).
- Quality Control: Custom metrics (ICScore for images, combining CLIP-Score and PAC-Score; ACScore for audio, combining ROUGE-1 and BERT-Score) were used to filter and regenerate low-quality captions.
Pairing Strategy: Artworks and music tracks were paired using ImageBind embeddings. The authors evaluated four pairing strategies (Raw Image–Raw Audio, Image–Audio Caption, etc.). They selected the Image–Audio (raw-to-raw) pairing based on qualitative inspection, despite it having lower average cosine similarity than caption-based pairs, as it yielded the most semantically aligned results.

B. Model Architecture (Art2Mus)

The framework is built upon AudioLDM 2, a latent diffusion model originally designed for text-to-audio generation.

Core Innovation: Instead of feeding text tokens, Art2Mus injects visual embeddings directly into the conditioning space.
Visual Conditioning Extractor:
1. Image Encoder: Extracts embeddings using either CLIP (512-dim) or ImageBind (1024-dim).
2. Image Aligner: A modular network that upscales 512-dim embeddings to 1024-dim (if necessary) and projects them into GPT-2's token space (768-dim). This creates a set of visual tokens ( $N$ tokens per image) that mimic the structure of text tokens.
Generation Process:
- The projected visual tokens are fed into the frozen GPT-2 translator, which converts them into LoA (Language of Audio) embeddings.
- These embeddings condition a Transformer U-Net denoiser within the latent diffusion process.
- The model generates mel-spectrogram latents, which are decoded by a VAE and a HiFiGAN vocoder into audio waveforms.
Training: Only the Image Aligner parameters are updated; the AudioLDM 2 backbone (denoiser, VAE, vocoder) remains frozen. The model is trained on triplets (Artwork, Paired Music, Fixed Prompt) using a classifier-free guidance strategy with a negative prompt ("Low quality").

4. Experimental Results

Objective Metrics

The model was evaluated against baselines including AudioLDM 2 (text-to-audio), AudioLDM 2-Music, and Mozart's Touch (Image $\to$ Caption $\to$ Music).

FAD (Fréchet Audio Distance): Art2Mus (ImageBind) achieved the lowest FAD (13), indicating high perceptual realism, outperforming Mozart's Touch (19) and AudioLDM 2 (14).
KL-Divergence: Art2Mus showed competitive distributional similarity (0.00296).
IBSc (ImageBind Score): This metric measures cross-modal alignment.
- Art2Mus (ImageBind) achieved an IBSc of 0.203 (Artwork vs. Generated Music).
- While text-based baselines (like Mozart's Touch) achieved slightly higher alignment scores (0.216), the authors argue this is expected because text baselines benefit from explicit semantic supervision, whereas Art2Mus solves a strictly harder problem (direct visual-to-audio).

Subjective Evaluation

A study with 15 participants (musicians, artists, and general public) evaluated the generated music on quality, expressiveness, pertinence, and alignment.

Trade-off: ImageBind-based generations scored slightly higher on Audio Quality (clarity/stability), while CLIP-based generations scored higher on Expressiveness and Alignment (capturing emotional content).
Conclusion: Art2Mus achieves a strong trade-off, generating musically coherent and stylistically consistent outputs that reflect salient visual cues without relying on text.

5. Significance and Impact

Paradigm Shift: This work establishes direct visual-to-music generation as a viable and distinct research direction, moving away from the reliance on text as a semantic shortcut.
Preservation of Nuance: By bypassing the text bottleneck, Art2Mus preserves fine-grained stylistic and compositional details (e.g., brushstrokes, color gradients) that are often lost when an image is summarized into a caption.
Resource Availability: The release of ArtSound (105k pairs) provides a critical benchmark for future research in cross-modal creative generation, addressing the scarcity of large-scale artwork-music datasets.
Applications: The framework has significant potential for applications in cultural heritage (generating soundscapes for museums), multimedia art, and AI-assisted creative practice where artists wish to translate visual concepts directly into sound.

In summary, Art2Mus demonstrates that with the right architectural alignment and large-scale data, deep learning models can learn to "hear" visual art directly, opening new avenues for multimodal creativity beyond the constraints of language.