WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

This paper introduces WAVE, the first LLM-based embedding model that unifies text, audio, and video into a single representation space through hierarchical feature fusion and joint multi-task training, achieving state-of-the-art performance in cross-modal retrieval and prompt-aware multimodal question answering.

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you have a massive library where books, movies, music, and podcasts are all stored in completely different rooms. Currently, if you want to find a movie that sounds like a specific song, or a book that matches the mood of a video, you have to hire a different librarian for each room. They speak different languages and don't talk to each other.

WAVE is like hiring a super-intelligent, multilingual "Universal Librarian" who can walk into any room, understand the content instantly, and translate it all into a single, universal language that everyone understands.

Here is a breakdown of what this paper is about, using simple analogies:

1. The Problem: The "Tower of Babel" of Media

Right now, computers are good at understanding text, and they are getting better at understanding images and video. But audio (sound) and video (moving pictures) are often treated as separate worlds.

  • The Old Way: You have one computer brain for text, another for pictures, and a third for sound. They don't really know how to talk to each other. If you ask, "Find me a video that sounds like jazz," the text brain and the sound brain might get confused.
  • The WAVE Solution: WAVE is a new type of AI (based on a powerful model called Qwen2.5-Omni) that learns to put text, audio, silent video, and full video all into one shared "mental space." It's like having a single dictionary where the word "sunset" has the same meaning whether you read it, see a video of it, or hear the wind blowing during it.

2. How It Works: The "Swiss Army Knife" Architecture

The paper describes a few clever tricks WAVE uses to become this Universal Librarian:

  • The Dual-Ear System: Humans have two ears to hear different things (like a voice vs. background noise). WAVE uses a dual-encoder for audio. One part listens for "speech" (like a person talking), and the other listens for "events" (like a dog barking or rain falling). This lets it understand the full richness of sound.
  • The "Layer Cake" Fusion: Usually, AI models look at the very last layer of their brain to make a decision. WAVE is smarter; it looks at every layer of its brain, from the bottom (simple details) to the top (complex ideas). It then mixes all these layers together like a chef blending ingredients in a smoothie. This ensures it doesn't miss the small details or the big picture.
  • The "Prompt" Magic: Most AI models just give you a generic summary. WAVE is prompt-aware. Think of it like a personal assistant.
    • Generic AI: "Here is a summary of the video."
    • WAVE: "Here is a summary of the video specifically focusing on the colors" or "specifically focusing on the sad music."
    • Because it can follow instructions, it creates a "customized" version of the video's meaning depending on what you ask.

3. The Training: Learning by Doing Everything at Once

To teach WAVE, the researchers didn't just show it one type of data. They threw everything at it at the same time:

  • They showed it videos and asked it to find matching text.
  • They showed it audio and asked it to find matching videos.
  • They asked it questions about what it saw and heard.

The Analogy: Imagine training a student by making them study math, history, and music simultaneously in the same class. Instead of being a specialist who only knows math, this student learns how math concepts relate to music rhythms and historical patterns. The paper found that this "joint training" made WAVE much smarter than models trained on just one subject.

4. The Results: Why It Matters

The paper tested WAVE on some very difficult challenges:

  • The "Any-to-Any" Game: Can you find a video using a sound clip? Can you find a song using a text description? WAVE is currently the best at this. It's like being able to find a movie just by humming a tune from it.
  • The "Quiz Show": When asked specific questions about a video (e.g., "What color was the car in the background?"), WAVE got much better scores than other models because it could focus its "attention" based on the question.
  • Beating the Giants: It beat existing top-tier models (even some from big tech companies) on video and audio benchmarks.

The Big Takeaway

WAVE is a breakthrough because it stops treating sound, sight, and words as separate things. It creates a unified language for all senses.

In everyday terms:
Before WAVE, if you wanted to find a video of a "happy dog playing in the rain," you had to search for "dog," then "rain," then hope the algorithm guessed you wanted them together.
With WAVE, you can just say, "Show me the happy dog in the rain," and it understands the feeling and the context of all those elements combined, instantly finding exactly what you need. It's the first step toward AI that truly understands the world the way humans do—through a mix of sight, sound, and language all at once.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →