SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

Imagine you have a massive, high-definition movie studio (a Video Diffusion Transformer) that can create stunning videos from text descriptions. The problem is, this studio is so huge and power-hungry that it can't fit inside a regular smartphone or a small laptop (an "edge device"). It needs a supercomputer to run.

To make this studio portable, engineers try to shrink it down using Quantization. Think of quantization like compressing a high-resolution photo into a JPEG. You lose some tiny details to save space, but the image still looks good.

However, video generation is tricky. Unlike a still photo, a video has moving parts, changing lights, and complex stories. If you compress it too roughly, the video becomes a blurry, glitchy mess where the character's face morphs into a blob or the background flickers.

SemanticDialect is a new, clever way to shrink these video studios so they can run on small devices without ruining the movie quality. Here is how it works, explained through three simple analogies:

1. The "Swiss Army Knife" vs. The "One-Size-Fits-All" Tool

The Problem: Traditional compression methods are like using a single, blunt hammer to fix everything. Sometimes you need a screwdriver, sometimes a wrench. If you use a hammer for a screw, you break it. In video AI, some parts of the data are tiny and delicate, while others are huge and loud. A single compression format breaks the delicate parts.

The Solution (Mixed-Format): Imagine instead of one hammer, you have a Swiss Army Knife with 32 different tools (a "Formatbook").

The Old Way: You pick one tool and try to use it for the whole job.
SemanticDialect: It looks at every tiny section of the video data and instantly picks the perfect tool for that specific job.
The Magic Trick: Usually, checking 32 tools takes too long. SemanticDialect uses a Look-Up Table (LUT)—like a cheat sheet or a menu with pictures. Instead of calculating which tool is best, it just glances at the menu, points to the right picture, and grabs the tool instantly. This makes the process fast enough for a phone.

2. The "Residual Error" (The "Fix-It" Kit)

The Problem: Even with the best Swiss Army Knife, sometimes you still make a tiny mistake. In video AI, some layers are super sensitive (like the director's voice). If you compress them even a little bit, the whole video gets noisy. Usually, to fix this, you'd have to keep those parts in high definition (which defeats the purpose of shrinking the model).

The Solution (Activation Decomposition): Imagine you are painting a masterpiece. You make a small mistake on a brushstroke. Instead of throwing away the whole canvas, you take a tiny bit of extra paint (the "residual error"), fix the mistake, and add it back on top.

SemanticDialect does this mathematically. It compresses the main data, calculates the tiny "mistake" it made, compresses that mistake separately, and adds it back in.
The Smart Filter: It doesn't fix every mistake (that would be too slow). It uses Attention (the AI's "gaze") to find the most important "stars" of the video (the main characters or key objects) and only fixes the errors on those. It ignores the background noise to save time.

3. The "Family Reunion" (Semantic Awareness)

The Problem: Imagine a video of a dog running through a park. The dog's nose in Frame 1 and the dog's nose in Frame 2 are the same object. But because the AI looks at them as separate tiny blocks, it might compress the nose in Frame 1 using a "blue" tool and the nose in Frame 2 using a "red" tool. When you stitch the video together, the dog's nose flickers and looks weird. This is a loss of semantic consistency.

The Solution (SeDA - Semantic-Aware Dialect Assignment):

SemanticDialect acts like a family reunion organizer. It knows that the dog's nose in Frame 1 and Frame 2 are "family members" (semantically related).
It forces these related parts to use the same tool (the same "sub-formatbook").
Even if the data looks slightly different, the AI ensures that the "family" stays consistent. This keeps the video smooth and prevents the "flickering" effect, ensuring the dog looks like the same dog throughout the movie.

The Result

By combining these three tricks:

Smart Tool Selection (using a cheat sheet to pick the right compression for every tiny block).
The Fix-It Kit (adding back the tiny mistakes only where it matters most).
The Family Organizer (making sure related parts of the video stay consistent).

SemanticDialect manages to shrink a massive video AI model down to 4-bit (a tiny fraction of its original size) while keeping the video quality almost as good as the original, uncompressed version. It's like fitting a 4K movie studio into a backpack without losing the plot or the picture quality.

1. Problem Statement

Video Diffusion Transformers (VDiTs), such as Open-Sora, have achieved state-of-the-art video generation quality but suffer from prohibitive memory and compute costs, hindering deployment on edge devices. While quantization (reducing precision) is a standard solution, applying it to VDiTs presents unique challenges:

High Activation Variation: VDiT activations exhibit large outliers and high variability across time and space. Standard low-precision formats (e.g., INT4, FP4) struggle to represent these distributions without significant error, leading to severe quality degradation.
Spatiotemporal Coherence: Video generation relies on strong temporal and semantic correlations. Existing quantization methods often treat blocks independently, causing "over-specialization" where semantically related tokens (e.g., the same object across frames) are quantized differently, breaking visual consistency.
Scalability of Mixed-Format Selection: Recent mixed-format approaches select an optimal format for each block from a "formatbook." However, extending this to VDiTs is computationally expensive due to the need for large formatbooks and complex online selection logic, which does not scale well for real-time inference.

2. Methodology: SemanticDialect

The authors propose SemanticDialect, a post-training quantization (PTQ) method that combines fine-grained block-wise mixed-format quantization with semantic awareness. The framework consists of four core components:

A. SD4: Scalable Mixed-Format Quantization

32-Dialect Formatbook: Unlike prior work using 16 dialects, SemanticDialect employs a larger 32-dialect formatbook. These dialects are rule-based designs that cover diverse dynamic ranges, densify small magnitudes (where most values cluster), and preserve large magnitudes (which dominate matrix multiplication results).
LUT-Based Online Selection: To avoid the high computational cost of calculating Mean Squared Error (MSE) for all 32 dialects online, the method uses Lookup Tables (LUTs).
- Two-Stage Selection: First, the block's maximum value determines a "sub-formatbook" (a subset of 8 dialects). Second, the method approximates MSE using pre-computed LUTs for quantized values and quantization errors, selecting the best dialect efficiently.
- Group-wise Max Approximation: Instead of sorting all elements to find outliers, the block is partitioned into groups, and the maximum of each group is used to estimate the distribution, balancing accuracy and speed.

B. Activation Decomposition for Sensitive Layers

Residual Re-Quantization: Certain layers (e.g., modulation layers, final MLPs) are highly sensitive to quantization errors. Instead of using mixed precision (which complicates hardware), the authors propose activation decomposition:
- $Act = Q(Act) + \Delta$
- The primary activation is quantized, and the residual error ( $\Delta$ ) is re-quantized and added back.
- This allows the use of a single low-precision format while recovering high-fidelity details.
Attention-Guided Salient Token Selection: Re-quantizing all tokens is too costly. The method identifies salient tokens (those with the highest attention scores to local spatiotemporal neighbors) and applies decomposition only to these.
- Scoring: Uses ReLU for temporal attention (focusing on positive correlations) and ABS for spatial/3D attention (capturing both similarity and contrast).
- Conditional Branch Awareness: The method intelligently allocates the salient token budget, often prioritizing the conditional branch in Classifier-Free Guidance (CFG) setups to prevent noise corruption in the unconditional branch.

C. Semantic-Aware Dialect Assignment (SeDA)

Problem: Independent block-wise selection can cause the same semantic token to be quantized with different dialects across frames, disrupting temporal consistency.
Solution: SeDA enforces consistency by grouping semantically related tokens.
- Anchor Selection: Identifies "anchor tokens" based on high attention scores within spatial tiles.
- Correlated Tokens: Finds tokens strongly attending to the anchor.
- Shared Sub-Formatbook: All tokens in an anchor-correlated group are forced to share the same 8-dialect sub-formatbook. This ensures that semantically linked tokens use consistent quantization scales, preserving spatiotemporal coherence without sacrificing the flexibility of the larger 32-dialect formatbook.

D. Overhead Mitigation

Temporal Stability: The method skips SeDA during the initial unstable timesteps of the denoising process and updates anchor/correlated tokens infrequently (e.g., every 10 steps) during stable phases, only updating every step in the final refinement phase.

3. Key Contributions

SD4 (SemanticDialect 4-bit): A calibration-free, fine-grained block-wise mixed-format quantization scheme using a 32-dialect formatbook and LUT-based selection, enabling efficient 4-bit inference.
Activation Decomposition: A technique to recover quantization errors in sensitive layers by re-quantizing residuals, applied selectively to attention-guided salient tokens to minimize overhead.
SeDA (Semantic-Aware Dialect Assignment): A novel mechanism to enforce spatiotemporal consistency by assigning semantically correlated tokens to shared sub-formatbooks, preventing "over-specialization."
State-of-the-Art Performance: Demonstrated superior performance over existing VDiT quantization methods (ViDiT-Q, Q-VDiT) and fine-grained baselines (NVFP4, MXFP4).

4. Experimental Results

The method was evaluated on Open-Sora 1.0 (factorized attention) and Open-Sora 2.0 (full 3D attention) using the VBench benchmark suite.

Quality Metrics: SemanticDialect significantly outperformed NVFP4 and other baselines across aesthetic quality, imaging quality, motion smoothness, and semantic consistency.
Near-FP16 Performance: On Open-Sora 2.0 with a block size of 16, SemanticDialect achieved performance within ~2.3 points of the FP16 baseline on key metrics, effectively matching human-perceived quality.
Robustness: Unlike other 4-bit methods that failed to generate readable videos (producing noise or broken structures), SemanticDialect maintained structural integrity and temporal coherence.
Ablation Studies:
- Removing SeDA led to a drop in temporal consistency.
- Removing activation decomposition significantly degraded aesthetic quality in sensitive layers.
- The LUT-based selection was found to be as effective as exact MSE calculation but much faster.

5. Significance

SemanticDialect represents a significant step forward in making high-quality video generation feasible on edge devices. By addressing the specific challenges of activation variability and spatiotemporal coherence in VDiTs, it bridges the gap between the efficiency of low-bit quantization and the fidelity required for video. The introduction of semantic awareness into the quantization process ensures that the compression does not degrade the semantic logic of the video, a critical factor often overlooked in previous quantization research. The use of LUTs for format selection makes the approach hardware-friendly and scalable, paving the way for efficient deployment of large-scale video diffusion models.