Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

Imagine you have a brilliant, world-class chef (the Large Vision-Language Model). This chef can look at a picture of a sunset and write a poem, or look at a complex medical chart and explain it to a patient. They are incredibly talented, but they are also huge, heavy, and expensive to run. They require a massive kitchen (computer memory) and a lot of electricity (computational power).

To make this chef accessible to everyone, we want to shrink them down. We want to pack their knowledge into a small, lightweight lunchbox so they can run on a regular laptop or even a phone. This process is called Quantization.

However, there's a catch. When you try to shrink a giant chef down to fit in a lunchbox, you have to simplify their recipes. You might say, "Instead of using 16 different spices, just use 4." This is great for saving space, but it often ruins the flavor. The dish comes out bland or wrong because the chef lost the nuance of the "important" spices.

The Problem: The "One-Size-Fits-All" Mistake

Previous attempts to shrink these chefs used a "static" approach. They looked at the whole kitchen and said, "Okay, these 50 spices are the most important ones; let's keep them safe, and simplify the rest."

But the authors of this paper, Quant Experts (QE), noticed a flaw in this logic. They realized that what is important changes depending on the specific dish being cooked.

The "Modality" Problem: If the chef is cooking a visual dish (looking at a photo), "color" might be the most important spice. If they are cooking a text dish (reading a story), "grammar" might be the most important.
The "Token" Problem (The Big Discovery): Even more surprisingly, within the same dish, the important ingredients change for every single word (or "token") the chef writes.
- When the chef writes the word "Sunset," the "orange" spice is critical.
- When they write the word "Ocean," the "blue" spice becomes critical.
- When they write "Bird," the "wing" spice is key.

Old methods tried to use a single, fixed "safety net" to catch all the mistakes caused by simplifying the spices. But because the mistakes change with every single word, a single safety net misses most of them.

The Solution: The "Quant Experts" Team

The authors propose a new system called Quant Experts (QE). Instead of one static safety net, they create a dynamic team of specialists.

Think of it like a high-end restaurant kitchen with a Head Chef and a team of specialized Sous Chefs.

1. The Head Chef (The Shared Expert)

Some ingredients are always important, no matter what dish you are making. Maybe "salt" is always crucial.

How QE works: They identify these "always-important" channels (ingredients) and assign them to a Shared Expert. This is a single, low-rank adapter (a small, efficient tool) that is always active. It handles the global, steady errors that happen everywhere.

2. The Specialized Sous Chefs (The Routed Experts)

But what about the ingredients that are only important for specific moments?

How QE works: They group the "sometimes-important" channels into clusters based on how often they appear together.
- Cluster A: Words related to "Nature" (Sun, Tree, Rain).
- Cluster B: Words related to "Technology" (Code, Screen, Data).
They create a Routed Expert (a specialized Sous Chef) for each cluster. Each Sous Chef has their own small tool to fix the specific mistakes that happen when cooking "Nature" dishes or "Technology" dishes.

3. The Smart Manager (The Router)

When the chef starts writing a sentence, a Smart Manager (the Router) looks at the current word.

If the word is "Sunset," the Manager says, "Hey, we need the Nature Sous Chef to help fix the errors!"
If the word is "Algorithm," the Manager switches to the Technology Sous Chef.
If the word is a generic connector like "and," the Manager just uses the Head Chef.

This happens instantly for every single word. The system dynamically switches between the Head Chef and the right Sous Chef to ensure the flavor (accuracy) is perfect, even though the lunchbox (quantized model) is tiny.

Why This is a Game-Changer

In the past, trying to shrink these giant models resulted in a loss of intelligence. The models would get confused or hallucinate.

With Quant Experts:

It's Adaptive: It doesn't just guess; it adapts to the specific context of every single word.
It's Efficient: It doesn't need to retrain the whole model from scratch. It just adds these small, smart "Sous Chefs" on top of the existing model.
The Results: The paper tested this on models ranging from small (2 billion parameters) to massive (72 billion parameters). Even when they compressed the models to extremely low precision (using only 4 bits for weights and 6 bits for activations), the Quant Experts method kept the model's performance almost identical to the original, full-sized version.

The Bottom Line

Imagine trying to pack a library into a backpack.

Old Method: You just throw books in randomly and hope the most important ones survive.
Quant Experts Method: You have a smart librarian who knows exactly which book you need right now. If you ask about history, they pull out the history book. If you ask about math, they pull out the math book. They keep the library organized and accessible, even in a tiny space.

This paper gives us a way to make giant AI models small and fast without losing their "brainpower," simply by giving them a team of specialists who know exactly what to fix, word by word.

1. Problem Statement

Large Vision-Language Models (VLMs) face significant computational and memory overheads, making Post-Training Quantization (PTQ) essential for deployment. However, existing PTQ methods suffer from two main limitations when applied to VLMs:

Static Assumptions: Current methods (e.g., SmoothQuant, AWQ, LQER) rely on static identification of sensitive or outlier channels using fixed scaling coefficients or global low-rank adapters derived from calibration data.
Neglect of Token Dynamics: These methods overlook the fact that the importance of channels is not static. The paper observes that channel importance distributions and occurrence frequencies vary significantly across different modalities (e.g., text vs. images) and, crucially, among different tokens within the same modality.
Consequence: Global compensation strategies fail to capture local feature variations, leading to unsatisfactory quantization accuracy, especially in low-bit settings (e.g., W4A6).

2. Key Observations

The authors conducted a systematic analysis of VLMs (specifically Qwen2VL and InternVL) and identified two critical phenomena:

Dynamic Channel Positions: The positions of "important channels" (those with high activation magnitudes or sensitivity) shift dynamically. They vary not only between modalities but also between individual tokens within the same modality due to semantic and contextual differences.
Uneven Frequency Distribution: Important channels exhibit a bimodal frequency distribution:
- Token-Independent Channels: A small subset of channels appears consistently across most tokens. These contribute to global quantization errors.
- Token-Dependent Channels: The majority of important channels are activated only for specific tokens. These contribute to local, token-specific quantization errors.
- Implication: A single global adapter cannot effectively compensate for both types of errors simultaneously.

3. Methodology: Quant Experts (QE)

To address these issues, the authors propose Quant Experts (QE), a token-aware adaptive error reconstruction framework utilizing a Mixture-of-Experts (MoE) architecture.

A. Channel Partitioning

Using calibration data, QE estimates the occurrence frequency of important channels for each layer. Channels are partitioned into two disjoint sets:

Token-Independent Set ( $C_s$ ): Channels that appear frequently across tokens.
Token-Dependent Set ( $C_r$ ): Channels that appear infrequently or only for specific tokens.

B. Mixture-of-Experts Architecture

QE employs two types of low-rank adapters (experts) to reconstruct quantization errors:

Shared Expert (SE):
- Target: Token-independent channels ( $C_s$ ).
- Mechanism: A single, fixed low-rank adapter ( $L_S$ ) is applied to all tokens to compensate for global quantization errors. It is trained using whitening SVD to handle outlier magnitudes.
Routed Experts (REs):
- Target: Token-dependent channels ( $C_r$ ).
- Mechanism:
  - Clustering: Token-dependent channels are clustered into $N_r$ subgroups based on their co-occurrence patterns (using spectral clustering on a co-occurrence matrix derived from calibration data).
  - Routing: A lightweight router predicts a score for each expert based on the input token. The router selects the expert with the minimal predicted error.
  - Adaptation: Each cluster has its own dedicated low-rank adapter ( $L_{R,i}$ ) to handle local error patterns specific to that group of tokens.

C. Training and Inference

Calibration: The partitioning, clustering, and initial weights of the experts are determined using a small calibration dataset (e.g., 128 image-caption pairs).
Refinement (Optional): A lightweight refinement stage can be applied where only the routed experts and the router are trainable (frozen weights otherwise), optimizing the selection and compensation without full model retraining.
Inference: For a given input token, the Shared Expert is always active. The Router dynamically selects the most suitable Routed Expert to add to the output, enabling adaptive local error correction.

4. Key Contributions

Novel Observation: First to systematically demonstrate that channel importance in VLMs is highly dynamic across both modalities and individual tokens, necessitating a shift from static to token-aware quantization.
Quant Experts Framework: Proposes a novel MoE-based quantization method that decouples global and local error compensation using a Shared Expert and multiple Routed Experts.
Token-Aware Routing: Introduces a mechanism to dynamically select error compensation strategies based on input tokens, effectively handling the non-stationary nature of VLM activations.
Efficiency: Achieves high accuracy with minimal overhead by using low-rank adapters and a lightweight router, maintaining inference speed comparable to full-precision models on hardware accelerators.

5. Experimental Results

The method was evaluated on various VLMs (Qwen2VL-2B/7B/72B, InternVL2-2B/8B) across diverse benchmarks (MMMU, OCRBench, ScienceQA, etc.).

Performance Gains:
- Qwen2VL-2B (W4A6): QE improved accuracy by 4.01% over the state-of-the-art modality-aware baseline (MBQ) and reduced the gap to full precision to within 4.23%.
- Qwen2VL-72B (W4A6): QE achieved a remarkable 5.09% average accuracy improvement over MBQ, nearly matching full-precision performance.
- Generalization: Consistent improvements were observed across W4A6, W4A8, and W3A16 settings, and on both vision-language and language-only tasks (MMLU).
Ablation Studies:
- Removing either the Shared Expert or Routed Experts significantly degraded performance, confirming the necessity of handling both global and local errors.
- Random routing or random clustering resulted in lower performance, validating the effectiveness of the co-occurrence-based clustering and learned router.
Efficiency:
- Computational Overhead: The additional FLOPs are minimal ( $sd(2r + N_r)$ vs. $sd^2$ ).
- Hardware Speedup: On NPU simulations (FlightLLM architecture), QE achieved 3.5x–4.5x acceleration compared to FP16 models during the prefill stage.

6. Significance

This work represents a paradigm shift in VLM quantization. By moving from static, global compensation to dynamic, token-aware compensation, QE effectively mitigates the performance degradation typically associated with low-bit quantization in multimodal models. It demonstrates that modeling the dynamic relationships between input tokens and channel sensitivities is crucial for deploying large-scale VLMs on resource-constrained devices without sacrificing accuracy. The approach is generalizable, efficient, and sets a new state-of-the-art for PTQ in the multimodal domain.