AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

The Big Picture: Teaching a Robot to Sing (and Sound Real)

Imagine you are trying to teach a robot to sing or generate sound effects. You have a "Student" robot (the AI model) and a "Teacher" robot (a pre-trained expert that already knows how to sound good).

The goal is to make the Student learn faster by copying the Teacher. In the world of AI, this is called Representation Alignment (REPA). The idea is simple: "Hey Student, look at what the Teacher is thinking at step 5 of the process, and try to think the same thing."

The Problem:
The researchers found that the old way of doing this was like a bad coach. The coach would say, "Okay, Student, look at the Teacher's middle brain thoughts." But it turns out, the middle thoughts might be full of interesting facts (like "this is a dog barking"), but they aren't actually the ones driving the robot's mouth to move.

The paper introduces a new method called AG-REPA (Attribution-Guided REPA) to fix this.

The Core Discovery: "Knowing" vs. "Doing"

The authors discovered a strange phenomenon they call Store-Contribute Dissociation (SCD). Let's break this down with an analogy:

The Library vs. The Construction Crew
Imagine the AI model is a massive construction site building a house (the audio).

The "Storage" Layers (Deep Layers): These are like the Library. They are full of blueprints, books, and knowledge. They "know" everything about what a house should look like. If you ask them, "What does a house look like?" they have the perfect answer.
The "Contribution" Layers (Shallow/Early Layers): These are like the Construction Crew at the very front of the site. They might not have the whole library of books, but they are the ones actually swinging the hammers and laying the bricks. They are the ones doing the work that moves the project forward.

The Mistake:
Old methods tried to align the Student with the Teacher's Library (the deep layers). They thought, "If the Student knows as much as the Teacher, it will build a better house."
The Reality: The Student was just memorizing the books but not learning how to swing the hammer. The house wasn't getting built faster or better.

The Insight:
The paper says: "Knowing is not Doing." To make the AI learn faster, you shouldn't force it to copy the Teacher's knowledge; you should force it to copy the Teacher's actions (the layers that actually drive the sound generation).

The Solution: The "Gatekeeper" Test (FoG-A)

How do we know which layers are the "Construction Crew" and which are just the "Library"?

The authors invented a tool called FoG-A (Forward-only Gate Ablation).

The Analogy: The "What If" Game
Imagine you are watching the construction crew. To see who is actually important, you play a game of "What If":

You pretend to close the gate on the Construction Crew (Layer 1).
You watch the house. Crash! The whole building stops. The crew was essential.
Now, you close the gate on the Library (Layer 24).
You watch the house. The crew keeps hammering. The house keeps going. The library was just watching.

FoG-A does this mathematically. It temporarily "turns off" each layer of the AI and sees how much the final sound changes.

If turning off a layer ruins the sound $\rightarrow$ That layer is a Causal Driver. (We must align this one!)
If turning off a layer changes nothing $\rightarrow$ That layer is just Storage. (We can ignore it.)

The New Strategy: AG-REPA

Instead of guessing which layer to align (like "Let's align Layer 8 because it's in the middle"), AG-REPA uses the FoG-A test to find the real "Causal Drivers."

Identify: It finds the specific layers that are actually doing the heavy lifting (usually the early layers).
Align: It forces the Student to copy the Teacher only on those specific, active layers.
Result: The Student learns the mechanics of making sound, not just the theory of it.

The Results: Why It Matters

The researchers tested this on two big tasks:

Text-to-Speech: Making a computer read text like a human.
Text-to-Audio: Making sound effects (like a dog barking or rain falling) from text descriptions.

The Outcome:

Better Quality: The sounds were clearer and more natural (lower "Word Error Rate" and better "MOS" scores).
Faster Learning: The AI reached high quality much faster because it wasn't wasting time copying the "Library" layers.
Universal: This worked on different types of AI models, proving that the "Knowing vs. Doing" gap is a universal rule in AI, not just a fluke.

Summary in One Sentence

Don't teach an AI to memorize the encyclopedia; teach it to copy the specific actions that actually build the result.

By using AG-REPA, we stop aligning the AI with layers that just "know" things and start aligning it with the layers that actually "do" the work, leading to smarter, faster, and higher-quality audio generation.

1. Problem Statement

Context: Flow Matching (FM) models have become the dominant paradigm for audio generation (speech and general audio) due to their efficient training and inference. However, training these models is computationally expensive. Representation Alignment (REPA) has emerged as a technique to accelerate training by aligning intermediate hidden states of the student model with features from a pretrained "teacher" model.

The Limitation: Existing REPA strategies rely heavily on heuristic layer selection (e.g., aligning a fixed "mid-layer" like Layer 8) or cross-modal supervision (using visual features). This approach assumes that layers storing the most semantic information are the same layers that drive the generation process.
The Core Question: Does the layer that "knows" the most (high representation similarity to the teacher) actually "do" the most (contribute causally to the velocity field driving generation)? The authors argue that current methods overlook this distinction, leading to suboptimal training efficiency.

2. Key Insight: Store-Contribute Dissociation (SCD)

Through systematic analysis, the authors identify a phenomenon they term Store-Contribute Dissociation (SCD):

Storage (Representation): Deep layers in the network act as "semantic reservoirs," storing rich acoustic and semantic information (high similarity to teacher features).
Contribution (Function): Shallow layers (specifically the early layers) act as the "causal drivers," contributing disproportionately to the velocity field gradients that determine the generation trajectory.
The Mismatch: Aligning deep layers (high storage) yields diminishing returns because the model already stores this information but does not actively use it for the critical velocity estimation. Conversely, aligning shallow layers (high contribution) is crucial but often ignored by heuristic methods.

3. Methodology

The paper proposes AG-REPA (Attribution-Guided Representation Alignment), a framework that shifts layer selection from static heuristics to dynamic, causal attribution. The methodology consists of three main components:

A. Diagnostic Toolkit

To quantify the SCD, the authors introduce three complementary metrics:

Bi-Stream Teacher Cosine Alignment (BiT-C): A dual-teacher distillation framework using frozen Whisper (for speech semantics) and BEATs (for general audio acoustics) to anchor the conditioning interface.
Layer-wise Analysis via Shared Projection (LASP): Measures "what the network knows." It projects layer representations into a shared teacher space using a frozen projection head to calculate cosine similarity. High LASP scores indicate high information storage (typically in deep layers).
Forward-only Gate Ablation (FoG-A): Measures "what the network uses." This is a causal intervention metric where a specific layer's gate is closed ( $m_k=0$ ) during the forward pass. The resulting change in the predicted velocity field ( $\|v^{\setminus k}_\theta - v_\theta\|$ ) quantifies the layer's causal contribution. High FoG-A scores indicate functional necessity (typically in shallow layers).

B. The AG-REPA Strategy

Instead of aligning a fixed layer, AG-REPA dynamically selects and weights layers based on FoG-A scores:

Selection: Automatically selects the Top-K layers with the highest causal attribution scores (FoG-A).
Weighting: Assigns alignment weights ( $\lambda_k$ ) proportional to each selected layer's FoG-A score. Layers with higher causal impact receive stronger supervision signals.
Objective: The total loss combines the standard Flow Matching loss, an input-interface alignment loss, and a sparse, weighted alignment penalty applied only to the functionally critical layers.

4. Key Contributions

Theoretical Discovery: Unveiled the Store-Contribute Dissociation (SCD) in token-conditioned audio Flow Matching, demonstrating that information-rich layers are distinct from causally critical layers. This explains the inefficiency of depth-based heuristic alignment.
Novel Method (AG-REPA): Proposed a causality-driven training strategy that uses FoG-A to dynamically select and weight alignment targets, moving beyond static heuristics.
Unified Framework: Developed a unified training setup for both Text-to-Speech (TTS) and Text-to-Audio (TTA) using a single DiT-based backbone, validated across diverse tokenization topologies.

5. Experimental Results

The authors evaluated AG-REPA on unified speech (LibriSpeech) and general audio (AudioSet) generation tasks.

Performance Gains: Compared to the best fixed-layer REPA baselines, AG-REPA reduced the Fréchet Audio Distance (FAD) by 18% for speech and 16% for general audio.
Quality Metrics: AG-REPA achieved the lowest Word Error Rate (WER: 3.45) and highest Mean Opinion Score (MOS: 4.12), outperforming multi-layer heuristic baselines by 11%.
Ablation Studies:
- Aligning "Deep" layers (high LASP) yielded marginal gains.
- Aligning "Shallow" layers (high FoG-A) recovered most of the improvement.
- AG-REPA outperformed Shallow REPA by adaptively weighting the top causal layers, avoiding over-constraint.
- Convergence Speed: AG-REPA reached target quality 3.3× faster than random selection and significantly faster than LASP-based selection.
Generalization: The method was successfully applied to other state-of-the-art models (Voicebox, CosyVoice, F5-TTS), consistently improving WER, FAD, and MOS, proving the "knowing vs. doing" gap is a general property of Flow Matching architectures.

6. Significance

Paradigm Shift: The paper challenges the prevailing assumption that "more representation = better alignment." It establishes that causal contribution is the superior metric for selecting alignment targets in generative models.
Interpretability to Action: It provides a concrete pipeline (BiT-C, LASP, FoG-A) to "unlock the black box" of audio generation, transforming mechanistic insights into actionable training strategies.
Efficiency: By focusing supervision on the layers that actually drive the velocity field, AG-REPA significantly accelerates convergence and improves generation quality without requiring larger models or more data.
Broader Impact: The findings suggest that similar "dissociation" phenomena may exist in other generative domains, offering a new lens for optimizing training dynamics in diffusion and flow-based models.

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

The Big Picture: Teaching a Robot to Sing (and Sound Real)

The Core Discovery: "Knowing" vs. "Doing"

The Solution: The "Gatekeeper" Test (FoG-A)

The New Strategy: AG-REPA

The Results: Why It Matters

Summary in One Sentence

1. Problem Statement

2. Key Insight: Store-Contribute Dissociation (SCD)

3. Methodology

A. Diagnostic Toolkit

B. The AG-REPA Strategy

4. Key Contributions

5. Experimental Results

6. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank