Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

This paper introduces MMHNet, a multimodal hierarchical network incorporating non-causal Mamba that enables video-to-audio models trained on short clips to effectively generalize and generate high-quality audio sequences exceeding five minutes in duration.

Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji

Published 2026-02-26
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to be a movie sound designer. Your goal is to show the robot a silent video clip and have it generate the perfect background noise, dialogue, and sound effects to match what's happening on screen.

This is the challenge of Video-to-Audio (V2A) generation. While we have gotten really good at making short soundtracks (like 8 seconds of a dog barking), making a continuous, high-quality soundtrack for a 5-minute movie scene has been incredibly difficult. The robot usually gets confused, the audio sounds robotic, or the sounds stop matching the video halfway through.

The paper "Echoes Over Time" introduces a new system called MMHNet that solves this problem. Here is how it works, explained through simple analogies.

The Problem: The "Short-Term Memory" Robot

Most current AI models are like students who only study for 8-second pop quizzes.

  • The Issue: If you ask them to write a 5-minute essay (generate 5 minutes of audio), they panic. They forget what happened at the beginning, they start repeating themselves, or they lose track of the plot.
  • The Old Way (Transformers): Traditional models use "positional embeddings." Think of this like giving the robot a numbered list of instructions (1, 2, 3...). If the list goes up to 100, the robot knows exactly where it is. But if you suddenly ask it to write a list up to 1,000, it gets lost because it was never taught numbers that high. It tries to guess, and the result is messy.

The Solution: MMHNet (The "Smart Conductor")

The authors built a new system, MMHNet, which acts like a smart orchestra conductor rather than a rigid robot. It uses three main tricks to handle long videos:

1. The "Non-Causal" Ear (Listening to the Whole Room)

Most audio models are "causal," meaning they can only hear what happened before the current moment, like listening to a radio show live.

  • The Analogy: Imagine trying to understand a conversation in a noisy room. If you can only hear what people said 5 seconds ago, you might miss the context of what they are saying right now.
  • The Fix: MMHNet uses a technology called Non-Causal Mamba. This is like giving the conductor a surveillance camera that shows the entire room at once. It can see the whole video scene simultaneously. It doesn't have to guess what comes next; it knows the whole context, so the audio stays consistent from start to finish, even for 5 minutes.

2. The "Hierarchical" Filter (The VIP Pass)

Long videos are full of boring, repetitive moments (like a car driving down a straight road for 30 seconds). Processing every single frame and sound is a waste of energy.

  • The Analogy: Imagine a bouncer at a club. Instead of letting every single person in the crowd (every single data point) into the VIP area (the main brain of the AI), the bouncer checks their ID.
  • The Fix: MMHNet uses Token Routing. It acts as a smart filter.
    • Temporal Routing: It asks, "Is there a loud sound happening right now?" If it's just silence or background hum, it skips it.
    • Multimodal Routing: It asks, "Does this sound match the video?" If a car crashes on screen, it prioritizes the crash sound. If the video is just a static landscape, it ignores the audio data that doesn't match.
    • Result: The AI only processes the "VIP" moments that matter, saving energy and keeping the quality high for the long haul.

3. The "Compressed" Brain

Instead of trying to remember every single detail of a 5-hour movie, the AI compresses the information into a summary.

  • The Analogy: Think of reading a book. You don't memorize every single letter; you remember the story.
  • The Fix: The model processes the video in a "compressed space." It understands the essence of the scene (e.g., "a busy market") rather than getting bogged down in the pixel-by-pixel details. This allows it to stretch a short training memory (8 seconds) into a long, coherent performance (5 minutes) without breaking.

The Result: "Train Short, Test Long"

The most impressive part of this paper is the magic trick:

  • They trained the AI only on 8-second clips.
  • They tested it on 5-minute videos.
  • The Outcome: The AI didn't just "guess." It generated high-quality, synchronized audio that lasted for minutes, beating all previous models that tried to do the same thing.

Why This Matters

Before this, if you wanted to make a 5-minute video with AI sound, you had to chop it into tiny 8-second pieces, generate sound for each, and try to glue them together. The result was usually choppy and sounded like a broken record.

MMHNet is like a musician who, after practicing scales for 8 seconds, can suddenly play a full, complex symphony without missing a beat. It opens the door for AI to generate soundtracks for full movies, long video games, and documentaries, making the digital world feel much more real and immersive.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →