Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

Imagine you have a brilliant, highly educated chef (the Vision Transformer) who can recognize any object in a photo with incredible accuracy. However, this chef is a giant: they need a massive kitchen, expensive ingredients, and a huge team of assistants to work. You want to put this chef in a tiny food truck (an edge device like a phone or drone) where space and power are limited.

To make this work, you need to shrink the chef's knowledge down to fit in a backpack without losing their ability to cook great meals. This process is called Quantization.

Here is how this paper solves the problem of shrinking the chef, using two main tricks:

1. The "Group Hug" Strategy (Joint Optimization)

The Problem:
Previous methods tried to shrink the chef's knowledge block-by-block. Imagine trying to shrink a complex machine by taking it apart, fixing one gear, putting it back, then fixing the next gear. The problem is that in Vision Transformers, all the gears are tightly connected. Fixing one gear in isolation often breaks the connection to the next one, causing the whole machine to jam.

The Solution:
Instead of fixing gears one by one, this paper suggests looking at the entire machine at once.

The Analogy: Imagine a choir. If you tell the soprano section to sing louder, the bass section might need to sing softer to keep the harmony. Old methods told each section to adjust alone. This new method tells the entire choir to adjust together in real-time.
The Result: By optimizing all layers simultaneously, the model learns how to compensate for errors in one part by adjusting another part. It's like a dance where everyone moves in sync, ensuring the final performance (the image recognition) remains perfect even when the "volume" (precision) is turned way down.

2. The "Magic Art Generator" (Data-Free Calibration)

The Problem:
To shrink the chef, you usually need to show them thousands of real photos (like 10,000 pictures of cats) to practice. But what if you don't have those photos? Maybe they are private, or you just don't have them.

Old Way: You might try to describe a cat to an AI art generator using a simple prompt like "a photo of a cat." The AI might give you 100 pictures of the exact same orange tabby sitting in the same spot. This is boring and doesn't teach the chef how to recognize a black cat running in the rain.
The Paper's Trick: They teach the AI art generator to learn multiple "personalities" for each object.
- Instead of just one prompt, they learn 20 different "voices" for "kite." One voice thinks of a kite as a bird, another as a toy, another as a colorful shape in the wind.
- The Analogy: Imagine you are training a security guard. Instead of showing them 1,000 photos of the same suspect in the same coat, you show them photos of the suspect in a raincoat, a suit, a hat, and running away. You teach the guard to recognize the essence of the suspect, not just one specific look.
- How it works: The system uses a powerful AI (Stable Diffusion) to generate these diverse images automatically. It checks if the images look like the right object and ensures they look different from each other.

The Results: Super Small, Super Smart

By combining the "Group Hug" (joint optimization) with the "Magic Art Generator" (diverse synthetic data), the authors achieved something amazing:

Tiny Size: They compressed the models down to extremely low bits (W1.58A8). Think of this as compressing a high-definition movie into a tiny text file, yet it still plays perfectly. This is the first time this has been done successfully for Vision Transformers without needing real data.
Speed: The whole shrinking process takes about one hour on a single computer chip.
No Real Data Needed: They proved you can train the system using only the AI-generated "magic" images. The performance is almost as good as if you had used 10,000 real photos from the internet.

Summary

This paper is like inventing a way to shrink a giant, complex robot into a pocket-sized gadget.

They stopped fixing the robot piece-by-piece and started tuning the whole thing at once (Joint Optimization).
They stopped needing a warehouse of real photos and instead taught an AI to imagine thousands of diverse, perfect practice scenarios on the fly (Learned Prompt-Guided Data Generation).

The result? A smart, efficient AI that can run on your phone, recognizing objects just as well as the giant version, without ever needing to see a single real photo during the setup.

1. Problem Statement

Vision Transformers (ViTs) have achieved state-of-the-art performance in visual recognition but are computationally expensive and memory-intensive, hindering deployment on edge devices. While Model Quantization (reducing precision of weights and activations) is a standard solution, existing approaches face significant challenges when applied to ViTs:

Post-Training Quantization (PTQ) Limitations: Traditional PTQ methods (designed for CNNs) often fail on ViTs due to complex inter-layer and inter-block dependencies, non-uniform activation distributions (heavy tails/outliers), and sensitivity to channel-wise scaling. Most existing ViT PTQ methods use block-wise reconstruction, which ignores global correlations and fails to adapt effectively across the entire network.
Data Dependency: High-accuracy quantization usually requires Quantization-Aware Training (QAT), which needs labeled data and extensive retraining. Alternatively, standard PTQ requires a large calibration dataset of real images, which may be unavailable due to privacy constraints or data scarcity.
Low-Bit Failure: Existing methods struggle to maintain accuracy under extreme low-bit settings (e.g., ternary weights or W1.58A8), often resulting in catastrophic accuracy drops.

2. Methodology

The authors propose a unified framework addressing both the optimization strategy and the data scarcity problem.

A. End-to-End Joint PTQ Framework

Unlike prior block-wise approaches, this method optimizes the entire network jointly without labeled data.

Unified Objective: The framework jointly optimizes quantization parameters (step size, zero-point), channel-wise rescaling factors, and weight refinement terms across all transformer blocks simultaneously.
Channel-Wise Rescaling: Inspired by SmoothQuant and RepQ-ViT, the method introduces learnable scale ( $\alpha$ ) and shift ( $\beta$ ) vectors for input activations. This smooths the dynamic range of activations, redistributing quantization difficulty from activations to weights, which are more stable to quantize.
Loss Function: The optimization uses a distillation loss without ground-truth labels:
- Intermediate Feature Reconstruction: Minimizes Mean Squared Error (MSE) between intermediate features of the full-precision and quantized models.
- Logit Distillation: Uses Kullback-Leibler (KL) divergence between the final logits of the full-precision (teacher) and quantized (student) models.
- Regularization: An $L_1$ term on weight refinement parameters to prevent overfitting.
Efficiency: The process converges in approximately 1 hour for ViT-Small on a single GPU.

B. Data-Free Calibration via Learned Multi-Mode Prompts

To eliminate the need for real calibration data, the authors introduce a generative strategy using Stable Diffusion Turbo.

Learned Prompt Embeddings: Instead of using static, hand-crafted text prompts (e.g., "a photo of a [class]"), the method learns multiple distinct prompt embeddings for each ImageNet class.
Optimization Goals: The prompts are optimized to satisfy three criteria:
1. Classification Accuracy: Generated images must be correctly classified by a frozen, full-precision ViT.
2. Semantic Diversity: Prompts are encouraged to be orthogonal in the text embedding space to avoid semantic collapse (e.g., generating only "toy kites" instead of "bird kites").
3. Feature Diversity: Variance losses are applied to the generated images, ViT feature maps, and attention maps to ensure diverse layouts, backgrounds, and styles.
Stability: A re-initialization heuristic prevents prompts from drifting semantically away from the target class during training.

3. Key Contributions

End-to-End Joint Optimization: A novel PTQ framework that jointly optimizes all layers and inter-block dependencies for ViTs without labeled data, achieving superior stability and low-bit accuracy compared to block-wise reconstruction methods.
Data-Free Calibration Strategy: A generative approach using Stable Diffusion Turbo guided by learned multi-mode prompts. This strategy synthesizes diverse, semantically correct calibration data that approximates real ImageNet distributions, removing the dependency on real datasets.
Extreme Low-Bit Performance: The first PTQ method to demonstrate strong accuracy on ViT, DeiT, and Swin-T models under W1.58A8 (ternary weights) and W3A3 settings, a regime where previous methods failed.
Efficiency: The entire quantization pipeline (including prompt learning) is computationally efficient, taking ~1 hour for ViT-S and ~3 minutes per class for prompt learning.

4. Experimental Results

The method was evaluated on ImageNet-1K across ViT, DeiT, and Swin architectures under various bit-widths (W1.58A8, W3A3, W4A4, W6A6).

State-of-the-Art Accuracy:
- W1.58A8: Achieved 68.45% (ViT-S) and 78.51% (ViT-B) using real data, and 63.71% / 76.58% using synthetic data. This is a massive improvement over baselines like FIMA-Q (which dropped to ~4-45%) or RepQ-ViT (near 0%).
- W4A4: Achieved 78.35% (ViT-S) with real data and 77.61% with synthetic data, outperforming RepQ-ViT, FIMA-Q, and APHQ-ViT.
Data-Free vs. Real Data: The synthetic calibration data generated by learned prompts achieved performance within 1–2% of real-data calibration, significantly outperforming baselines using simple text templates (e.g., "a photo of...").
Scalability: Performance scales with calibration set size up to ~10,000 samples, after which it saturates. The end-to-end approach benefits more from larger datasets than block-wise methods.
Qualitative Analysis: t-SNE visualizations confirmed that images generated by learned prompts exhibit broader feature distributions and cluster closer to real ImageNet data compared to the biased, narrow clusters of raw text prompts.

5. Significance

This work represents a major step forward in the practical deployment of Vision Transformers on resource-constrained edge devices.

Bridging the Gap: It effectively bridges the gap between data-free and data-dependent quantization, making high-performance ViT deployment feasible even when training data is inaccessible.
Hardware Viability: By achieving robust performance at W1.58A8 (ternary weights), the method opens the door for ultra-low-power inference on specialized hardware that supports ternary arithmetic.
Generalizability: The framework is architecture-agnostic, successfully applied to ViT, DeiT, and Swin models, suggesting a general solution for Transformer quantization.
Generative AI for ML Ops: It demonstrates a novel use case for generative models (Stable Diffusion) not just for content creation, but as a critical tool for model compression and calibration in the absence of real data.

Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

1. The "Group Hug" Strategy (Joint Optimization)

2. The "Magic Art Generator" (Data-Free Calibration)

The Results: Super Small, Super Smart

Summary

1. Problem Statement

2. Methodology

A. End-to-End Joint PTQ Framework

B. Data-Free Calibration via Learned Multi-Mode Prompts

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation