Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

This paper presents a novel joint post-training quantization framework for Vision Transformers that achieves state-of-the-art low-bit accuracy without labeled data by combining full-model optimization with a data-free calibration strategy using Stable Diffusion Turbo guided by learned multi-mode prompts.

Shile Li, Markus Karmann, Onay Urfalioglu

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, highly educated chef (the Vision Transformer) who can recognize any object in a photo with incredible accuracy. However, this chef is a giant: they need a massive kitchen, expensive ingredients, and a huge team of assistants to work. You want to put this chef in a tiny food truck (an edge device like a phone or drone) where space and power are limited.

To make this work, you need to shrink the chef's knowledge down to fit in a backpack without losing their ability to cook great meals. This process is called Quantization.

Here is how this paper solves the problem of shrinking the chef, using two main tricks:

1. The "Group Hug" Strategy (Joint Optimization)

The Problem:
Previous methods tried to shrink the chef's knowledge block-by-block. Imagine trying to shrink a complex machine by taking it apart, fixing one gear, putting it back, then fixing the next gear. The problem is that in Vision Transformers, all the gears are tightly connected. Fixing one gear in isolation often breaks the connection to the next one, causing the whole machine to jam.

The Solution:
Instead of fixing gears one by one, this paper suggests looking at the entire machine at once.

  • The Analogy: Imagine a choir. If you tell the soprano section to sing louder, the bass section might need to sing softer to keep the harmony. Old methods told each section to adjust alone. This new method tells the entire choir to adjust together in real-time.
  • The Result: By optimizing all layers simultaneously, the model learns how to compensate for errors in one part by adjusting another part. It's like a dance where everyone moves in sync, ensuring the final performance (the image recognition) remains perfect even when the "volume" (precision) is turned way down.

2. The "Magic Art Generator" (Data-Free Calibration)

The Problem:
To shrink the chef, you usually need to show them thousands of real photos (like 10,000 pictures of cats) to practice. But what if you don't have those photos? Maybe they are private, or you just don't have them.

  • Old Way: You might try to describe a cat to an AI art generator using a simple prompt like "a photo of a cat." The AI might give you 100 pictures of the exact same orange tabby sitting in the same spot. This is boring and doesn't teach the chef how to recognize a black cat running in the rain.
  • The Paper's Trick: They teach the AI art generator to learn multiple "personalities" for each object.
    • Instead of just one prompt, they learn 20 different "voices" for "kite." One voice thinks of a kite as a bird, another as a toy, another as a colorful shape in the wind.
    • The Analogy: Imagine you are training a security guard. Instead of showing them 1,000 photos of the same suspect in the same coat, you show them photos of the suspect in a raincoat, a suit, a hat, and running away. You teach the guard to recognize the essence of the suspect, not just one specific look.
    • How it works: The system uses a powerful AI (Stable Diffusion) to generate these diverse images automatically. It checks if the images look like the right object and ensures they look different from each other.

The Results: Super Small, Super Smart

By combining the "Group Hug" (joint optimization) with the "Magic Art Generator" (diverse synthetic data), the authors achieved something amazing:

  1. Tiny Size: They compressed the models down to extremely low bits (W1.58A8). Think of this as compressing a high-definition movie into a tiny text file, yet it still plays perfectly. This is the first time this has been done successfully for Vision Transformers without needing real data.
  2. Speed: The whole shrinking process takes about one hour on a single computer chip.
  3. No Real Data Needed: They proved you can train the system using only the AI-generated "magic" images. The performance is almost as good as if you had used 10,000 real photos from the internet.

Summary

This paper is like inventing a way to shrink a giant, complex robot into a pocket-sized gadget.

  • They stopped fixing the robot piece-by-piece and started tuning the whole thing at once (Joint Optimization).
  • They stopped needing a warehouse of real photos and instead taught an AI to imagine thousands of diverse, perfect practice scenarios on the fly (Learned Prompt-Guided Data Generation).

The result? A smart, efficient AI that can run on your phone, recognizing objects just as well as the giant version, without ever needing to see a single real photo during the setup.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →