Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models

The Big Problem: The "Heavy Suit" Dilemma

Imagine you have a world-class chef (a Deep Neural Network) who can cook incredible meals. However, this chef is used to working in a massive, fully-equipped kitchen with unlimited space and ingredients.

Now, you want to send this chef to a tiny food truck (a mobile phone or tiny chip). The food truck has very little counter space and a tiny fridge. If the chef tries to bring all their heavy pots and pans (the full model), they won't fit, and the truck will break down. This is the Out-Of-Memory (OOM) problem.

To fix this, we need to shrink the chef's tools. We can't just throw away the good knives; we need to be smart about which tools get replaced with smaller, lighter versions. This is called Mixed-Precision Quantization (MPQ). It's like deciding: "Keep the expensive, heavy steel knife for the main course, but use a cheap plastic knife for the garnish."

The Old Way: The "Expert Chef" vs. The "Expensive Trial-and-Error"

Previously, figuring out which tools to shrink was done in two difficult ways:

The "Expert" Method: You hired a human expert to look at the kitchen and manually decide which tools to swap.
- The Downside: This takes a long time, costs a lot of money, and if you change the menu (the data), the expert has to start over.
The "Trial-and-Error" Method: You used a computer to try millions of combinations, cooking the meal over and over to see what worked.
- The Downside: This burns a massive amount of electricity and time. It's like trying to find the perfect recipe by cooking 10,000 cakes just to find the one that doesn't burn.

The New Solution: Meet "TAP" (The AI Architect)

The authors of this paper introduce TAP (Training-free Automatic Proxy). Think of TAP as a Super-Intelligent Architect who has read every cookbook in the world (thanks to being a Large Language Model or LLM) and can instantly design the perfect kitchen layout for the food truck without ever actually cooking a single meal.

Here is how TAP works, step-by-step:

1. The "Dream Team" (The LLM)

Instead of a human expert or a brute-force computer, TAP uses an AI that understands language and logic. It doesn't need to "learn" by cooking; it already knows the principles of cooking (math and logic) from its training.

2. The "Evolutionary Game" (Evolutionary Search)

TAP doesn't just guess once. It plays a game of "Evolution":

Generation 1: It asks the AI to write down 10 different ideas for shrinking the tools.
The Test: It quickly checks these ideas against a small sample of food (a tiny dataset) to see which ideas are "fit" (work well).
The "DPO" Coach: This is the paper's secret sauce. Imagine a coach watching the game. If Idea A works better than Idea B, the coach doesn't change the AI's brain (which would take too long). Instead, the coach just whispers to the AI: "Hey, next time, try asking for ideas that look more like Idea A."
- This is called Direct Preference Optimization (DPO). It's like tuning a radio dial to find the clearest station without rebuilding the radio.

3. The Result: A Perfect Blueprint in Seconds

After just a few rounds of this game (usually 5 rounds), TAP produces a perfect blueprint. It tells you exactly which tools to shrink and which to keep.

Speed: It does this in seconds.
Data: It only needs a tiny taste of food (16 samples) to figure it out, whereas old methods needed a whole banquet (thousands of samples).
No Training: The AI doesn't need to "study" or "re-train" itself. It just uses its existing knowledge to solve the puzzle.

Why This is a Game Changer

No More Human Experts Needed: You don't need a PhD in math to design these systems anymore. The AI does the heavy lifting.
It's Universal: The blueprint TAP designs for a ResNet (a type of AI) works just as well for a ViT (a different type of AI) or even a new dataset. It's like a universal adapter that fits any plug.
Efficiency: It saves massive amounts of energy and time. Instead of burning millions of dollars in electricity to "train" the quantization method, TAP just "thinks" about it and solves it instantly.

The Bottom Line

Imagine you have a giant, heavy library you need to fit into a backpack.

Old Way: You hire a librarian to manually pick books, or you try to stuff the whole library in and see what falls out (very slow and messy).
TAP Way: You ask a super-smart librarian who has read every book. They instantly tell you: "Keep the encyclopedias, shrink the comics, and throw away the magazines." They do this in a split second, using only a tiny sample of your library, and the result is perfect.

TAP proves that Large Language Models can be used not just to write poems or chat, but to solve complex engineering problems, making AI faster, smaller, and accessible to everyone.

1. Problem Statement

Mixed-Precision Quantization (MPQ) is a critical technique for deploying Deep Neural Networks (DNNs) on resource-constrained devices (e.g., MCUs, NPUs) to overcome Out-Of-Memory (OOM) bottlenecks. While MPQ adjusts bit-widths based on layer sensitivity to balance accuracy and efficiency, existing methods face two major limitations:

Differentiable Optimization Methods: These learn bit allocations via gradient-based search but suffer from prohibitively high computational costs and slow convergence, making them impractical for many scenarios.
Training-Free Methods (e.g., HAWQ, OMPQ): These avoid training costs but rely heavily on hand-crafted heuristics designed by human experts (e.g., Hessian traces, specific weight-activation statistics).
- Challenges: These manual proxies are labor-intensive, require extensive expert knowledge, depend on large calibration datasets (e.g., thousands of samples), and often fail to generalize to new architectures or hardware constraints without re-engineering.

The core question addressed is: Can we design a training-free MPQ proxy automatically without human experts or extensive training?

2. Methodology: The TAP Framework

The authors propose TAP (Training-free Automatic Proxy), a novel framework that leverages Large Language Models (LLMs) and Evolutionary Search to automatically discover superior MPQ proxies.

Core Components

LLM-Driven Proxy Candidate Generator:
- Instead of fixed formulas, the LLM acts as a generator that synthesizes proxies in the form of a tuple: (Natural Language Reasoning, Executable Code).
- The code computes sensitivity scores for layers (convolutional/linear) to guide bit-width allocation.
- The LLM operates within a context-aware search space, utilizing three prompt templates:
  - Initialization: Generates novel proxies based on task descriptions.
  - Mutation: Optimizes existing logic (e.g., adjusting statistical metrics).
  - Crossover: Fuses components from two high-performing parent proxies.
Fitness Evaluator:
- Evaluates candidate proxies on benchmarks (e.g., ImageNet-1k) without full training.
- Metrics:
  - Sensitivity Quality: Spearman correlation between the proxy's predicted sensitivity and actual quantization error.
  - Allocation Effectiveness: Top-1 accuracy of the quantized model using the proxy's bit-width assignment.
- The fitness score $\phi(f)$ combines these metrics to guide the evolutionary process.
DPO-Based Evolution Scheduler (The Innovation):
- To bridge the gap between black-box LLMs and the specific MPQ task, the authors introduce a Direct Preference Optimization (DPO) strategy controller.
- Mechanism: Instead of fine-tuning the LLM parameters (which is costly), the DPO module dynamically reweights the selection probabilities of the three prompt templates (Initialization, Mutation, Crossover) based on fitness signals.
- Process: Proxies are paired based on fitness scores to create "preference data." The controller updates the probability of selecting a specific template for the next generation, creating a task-aware feedback loop.
- Result: The LLM parameters remain frozen; only the strategy of how to prompt the LLM evolves, leading to increasingly stable and high-quality proxy generation.

3. Key Contributions

New Paradigm for MPQ Design: TAP is the first framework to use LLMs to automatically discover training-free MPQ proxies, shifting from expert-heuristics to automated reasoning.
DPO as a Template Selector: The paper introduces a lightweight DPO-based controller that optimizes the evolutionary search strategy without fine-tuning the underlying LLM. This creates a feedback loop that improves proxy quality iteratively.
Superior Efficiency and Performance: TAP eliminates the need for large calibration sets and expensive differentiable searches, achieving state-of-the-art results with minimal computational overhead.

4. Experimental Results

The authors evaluated TAP on mainstream benchmarks (ResNet-18/50, MobileNetV2, ViT-B, DeiT-B, Swin-B) across datasets like ImageNet-1k and CIFAR-10.

Accuracy: TAP achieves State-of-the-Art (SOTA) performance.
- On ResNet-18, TAP-C reached 72.63% Top-1 accuracy, outperforming training-free methods like EMQ (72.28%) and OMPQ (72.08%).
- On ResNet-50, it achieved 76.72%, surpassing EMQ (76.70%) and OMPQ (76.28%).
- It also demonstrated strong performance on Transformer architectures (ViT, DeiT, Swin), maintaining high accuracy under high compression ratios (e.g., 82%).
Efficiency:
- Search Cost: TAP requires only 0.42 GPU hours for ResNet-18, significantly lower than OMPQ (0.45) and EMQ (0.51).
- Calibration Data: TAP requires only 16 calibration samples and 5 iterations to converge, compared to HAWQ-V2 which needs 8,192 samples and 50 iterations.
- Runtime: The entire quantization process (proxy generation + bit allocation) takes less than 0.1 seconds on average.
Generalization: Proxies discovered on small datasets (CIFAR-10) transferred successfully to large-scale ImageNet tasks without retraining, demonstrating strong dataset-agnostic capabilities.
Robustness: Ablation studies confirmed TAP is insensitive to hyperparameters (e.g., $\alpha$ in fitness function), population size, and the specific choice of LLM backbone (tested with Deepseek, Qwen3, Grok).

5. Significance

Democratization of MPQ: By removing the dependency on human experts and expensive training, TAP makes advanced mixed-precision quantization accessible and scalable for diverse hardware and model architectures.
LLM as a Design Tool: The paper demonstrates that LLMs can effectively translate high-level reasoning into concrete, executable algorithms for low-level system optimization tasks, provided a proper feedback loop (DPO) is established.
Scalability: The framework offers a path toward fully automated, training-free compression pipelines that can adapt to future model architectures without manual intervention.

In conclusion, TAP represents a fundamental shift in MPQ design, replacing static, expert-defined rules with a dynamic, LLM-driven evolutionary process that is faster, more accurate, and requires significantly fewer resources.