Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

Imagine you are a chef trying to bake the perfect chocolate cake. You have a perfect recipe (the AI model) and a clear description of what you want (the text prompt, like "a chocolate cake with strawberries").

However, before you start mixing, you need to grab a handful of flour from a giant, chaotic bin. In the world of AI video generation, this "flour" is called noise.

Here is the problem: Even though the recipe is the same, the specific handful of flour you grab first changes everything.

Grab a "lucky" handful, and you get a fluffy, delicious cake.
Grab a "bad" handful, and you get a dense, burnt mess.

In the past, AI video makers just grabbed a handful of noise at random. Sometimes it worked; often, it didn't. Other researchers tried to fix this by using "external rules" (like sifting the flour through a specific sieve), but that was slow and didn't always work for every cake.

This paper introduces a new method called ANSE (Active Noise Selection for Generation). Here is how it works, using simple analogies:

1. The "Gut Feeling" of the AI (The Core Idea)

Instead of just grabbing noise randomly or using external sieves, ANSE asks the AI itself: "Hey, before we start baking, which handful of flour do you feel most confident about?"

The AI has a "gut feeling" (mathematically called uncertainty) about different noise seeds.

High Uncertainty: The AI is confused. It's like a chef looking at a pile of flour and thinking, "I have no idea what this will turn into. It might be a cake, or it might be mud."
Low Uncertainty: The AI is confident. It's like the chef saying, "I know exactly what this flour will do. It will make a great cake."

ANSE's goal is to find the noise that gives the AI the lowest uncertainty (the most confidence).

2. The "Crowd Test" (How they measure confidence)

How does the AI know if it's confident? The researchers use a clever trick called BANSA (Bayesian Active Noise Selection via Attention).

Imagine the AI is a committee of 10 chefs looking at the same handful of flour.

Bad Noise (High BANSA Score): Chef #1 thinks it's a cake. Chef #2 thinks it's a pie. Chef #3 thinks it's soup. They are all arguing. The committee is disagreed and confused. This is a bad seed.
Good Noise (Low BANSA Score): All 10 chefs look at the flour and say, "Yep, that's definitely going to be a cake." They are in agreement and confident. This is a good seed.

The paper measures this "agreement" by looking at the AI's Attention Maps. Think of attention maps as the AI's "gaze."

If the AI's gaze is shaky and jumps around wildly when looking at the noise, it's confused (Bad Seed).
If the AI's gaze is steady and focused, it knows what it's doing (Good Seed).

3. The "Speed Trick" (Making it fast)

Usually, to check if a committee of 10 chefs agrees, you'd have to ask them 10 separate times. That takes too long and slows down the video generation.

The paper's secret sauce is a Bernoulli Mask.
Instead of asking the committee 10 times, they ask them once, but they put a "blindfold" on a random few chefs during that one question. Because the blindfolds are random, the chefs' answers vary slightly, simulating 10 different opinions instantly.

This allows the system to check the "confidence" of the noise in a split second without slowing down the video creation process.

4. The Result: Better Movies, Faster

By picking the "lucky" noise seeds where the AI is most confident:

The videos look better: Fewer glitches, smoother motion, and the characters look more real.
The story matches better: If you ask for "a cat dancing," the AI is less likely to accidentally make a dog running.
It's efficient: It adds very little time to the process (about 10-15% more), whereas other methods that try to fix the noise later can double or triple the time.

Summary

Think of ANSE as a smart assistant who checks the AI's "mood" before starting a video. Instead of guessing which starting point will work, the assistant asks the AI, "Are you sure about this starting point?" If the AI is confident (low uncertainty), they go for it. If the AI is confused (high uncertainty), they pick a different starting point.

The result? The AI spends less time confused and more time making amazing videos.

1. Problem Statement

Text-to-Video (T2V) diffusion models suffer from high sensitivity to the initial noise seed. Even with the same prompt, different random seeds can yield drastically varying results in terms of video quality, temporal coherence, and prompt alignment.

Limitations of Current Methods: Existing approaches to improve noise initialization rely on external priors (e.g., frequency filtering, inter-frame smoothing, or Gaussian priors). These methods often require:
- Heavy fine-tuning or complex rescheduling strategies.
- Repeated full diffusion passes (inference-time scaling), leading to significant computational overhead (often >100% increase in inference time).
- They ignore internal model signals that inherently indicate which seeds are "good" or "bad" for a specific prompt.

2. Methodology: ANSE and BANSA

The authors propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty without retraining the model.

Core Concept: BANSA (Bayesian Active Noise Selection via Attention)

The framework adapts the BALD (Bayesian Active Learning by Disagreement) principle, traditionally used for classification uncertainty, to the generative attention space.

Mechanism: Instead of predicting class logits, BANSA measures the entropy disagreement across multiple stochastic attention samples.
Acquisition Function: For a given noise seed $z$ $z$ , prompt $c$ $c$ , and timestep $t$ $t$ , the model generates $K$ $K$ stochastic attention maps (via perturbations). The BANSA score is calculated as:
$\text{BANSA} = H\left(\frac{1}{K}\sum A^{(k)}\right) - \frac{1}{K}\sum H(A^{(k)})$
Where $H$ $H$ is Shannon entropy.
- Interpretation: A low BANSA score indicates that the attention maps are consistent and confident (low epistemic uncertainty). A high score indicates disagreement and uncertainty.
- Selection Strategy: The framework selects the noise seed with the lowest BANSA score, as empirical evidence shows these seeds lead to more coherent and prompt-aligned videos.

Efficient Approximations

To make BANSA feasible for inference without retraining or massive compute costs, the authors introduce two key optimizations:

Bernoulli-Masked Attention: Instead of running $K$ full forward passes (which is expensive), the method injects stochasticity into a single forward pass by applying random Bernoulli masks to the attention scores. This generates $K$ stochastic samples from one computation.
Layer Truncation: Computing BANSA across all attention layers is redundant. The authors use cumulative correlation analysis to identify a cutoff depth $d^*$ (e.g., layer 14 in CogVideoX-2B) where the partial BANSA score correlates highly with the full-layer score. This reduces computation significantly while preserving selection accuracy.

3. Key Contributions

First Active Noise Selection Framework for Video Diffusion: ANSE is the first method to treat noise selection as an active learning problem grounded in Bayesian uncertainty within the attention space of generative models.
BANSA Acquisition Function: Introduces a novel metric that measures attention consistency under stochastic perturbations, enabling model-aware selection without external priors or retraining.
Efficient Inference-Time Deployment: Demonstrates that high-quality selection can be achieved with marginal overhead (approx. 10–15% increase in inference time) by using Bernoulli masking and layer truncation.
Generalizability: The method is plug-and-play and works across diverse architectures, including U-Net based (AnimateDiff) and MMDiT based (CogVideoX, HunyuanVideo, Wan2.1) models.

4. Experimental Results

The method was evaluated on multiple state-of-the-art T2V backbones (AnimateDiff, CogVideoX-2B/5B, HunyuanVideo, Wan2.1) using the VBench benchmark and FVMD (Fréchet Video Motion Distance).

Quantitative Improvements:
- AnimateDiff: Improved Total VBench score from 77.98 to 79.33 (+1.35 points) with only a 10.98% inference time increase.
- CogVideoX-5B: Improved Total score from 81.52 to 81.71 with a 13.1% time increase.
- Comparison to SOTA: ANSE outperforms or matches frequency-prior methods (like FreqPrior) while being significantly faster. FreqPrior often incurs >100% overhead, whereas ANSE stays under 15%.
- Motion Quality: ANSE achieved lower FVMD scores on MSR-VTT, indicating better motion fidelity.
Qualitative Analysis:
- Videos generated with low-BANSA seeds showed improved temporal coherence, reduced flickering, and better anatomical correctness (e.g., "koala playing piano").
- Cross-Prompt Behavior: The paper demonstrates that noise effectiveness is prompt-dependent; a "good" seed for one prompt may be "bad" for another, validating the need for per-prompt selection rather than a universal seed.
Ablation Studies:
- Reversing the Criterion: Selecting seeds with the highest BANSA scores (highest uncertainty) resulted in degraded video quality, confirming the validity of the selection logic.
- Stochasticity: Bernoulli masking (BANSA-B) outperformed Dropout-based stochasticity, suggesting the masking strategy better captures attention-level uncertainty.

5. Significance and Impact

Inference-Time Scaling Paradigm: ANSE introduces a new way to scale diffusion models not by increasing the number of denoising steps or model parameters, but by intelligently selecting the initial condition.
Cost-Effectiveness: It provides a principled, generalizable approach to noise selection that offers significant quality gains with minimal computational cost, making it highly practical for real-world deployment.
Theoretical Insight: The work bridges the gap between active learning (uncertainty estimation) and generative modeling, proving that internal attention signals are sufficient to predict generation quality.
Future Directions: The authors suggest combining ANSE with post-training refinement methods (like Self-Forcing) to further enhance quality, as ANSE handles the initialization while other methods handle the sampling trajectory.

In summary, ANSE leverages the model's own attention mechanisms to "know" which noise seeds will work best, offering a lightweight, high-impact solution to the stochasticity problem in video generation.

Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

1. The "Gut Feeling" of the AI (The Core Idea)

2. The "Crowd Test" (How they measure confidence)

3. The "Speed Trick" (Making it fast)

4. The Result: Better Movies, Faster

Summary

1. Problem Statement

2. Methodology: ANSE and BANSA

Core Concept: BANSA (Bayesian Active Noise Selection via Attention)

Efficient Approximations

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach