QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

The paper proposes QFT, a framework that enables affordable full-parameter fine-tuning of large language models on single consumer GPUs by storing all training states in INT8 format, utilizing a quantization-robust Lion optimizer and a hybrid feature quantizer to maintain performance while drastically reducing memory consumption.

Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer

Published 2026-03-19
📖 5 min read🧠 Deep dive

Imagine you have a giant, incredibly smart library (a Large Language Model or LLM) with billions of books. You want to teach this library a new skill, like writing poetry or coding, by giving it a specific set of practice books (fine-tuning).

The problem? The library is so massive that to do this training, you usually need a supercomputer the size of a small house, costing hundreds of thousands of dollars. Most people can't afford that.

Enter QFT (Quantized Full-Parameter Tuning).

Think of QFT as a clever "packing and moving" strategy that lets you train this giant library using just a standard home computer (like a single high-end gaming GPU). Here is how it works, broken down into simple analogies:

1. The Problem: The "Heavy Suitcase"

Normally, when you train a model, you have to carry around three heavy things:

  • The Weights: The actual knowledge of the model (like the books).
  • The Gradients: The notes you take on what to change (like a highlighter).
  • The Optimizer States: The momentum and history of your changes (like a running tally of your progress).

In standard training, everything is stored in FP32 (32-bit floating point). Imagine this as writing every single note in your notebook with a thick, permanent marker. It's very precise, but it takes up a lot of space. For a 7-billion-parameter model, this "notebook" is so heavy it requires a 100GB+ memory bank. That's like trying to carry a grand piano up a flight of stairs.

2. The Solution: Switching to "Pencil Notes" (Quantization)

QFT says, "Why write everything in thick markers? Let's use pencils."
It converts all those heavy numbers into INT8 (8-bit integers). This is like switching from a thick marker to a standard pencil. You still write the same information, but it takes up 87% less space.

However, there's a catch: If you just switch to pencils, you might lose too much detail, and the model might get confused or stop learning effectively. The paper solves this with two special tricks.

3. Trick #1: The "Lion" Optimizer (The Steady Hand)

Usually, when you shrink your notes (quantize), the "momentum" part of your learning gets shaky. It's like trying to run while wearing heavy boots; you might stumble.

The authors discovered that the Lion optimizer is like a runner with a very steady, rhythmic stride.

  • How it works: Instead of tracking complex, messy details, Lion just looks at the direction of the change (positive or negative) and keeps the step size consistent.
  • The Analogy: Imagine you are navigating a maze. Most methods try to measure the exact distance to every wall (very precise, but heavy). Lion just says, "Is the wall to my left or right?" and takes a steady step. Because it doesn't rely on messy, tiny details, it doesn't care if your notes are written in pencil (INT8) instead of marker. It stays stable even with the "low-quality" notes.

4. Trick #2: The "Hybrid Feature" Quantizer (The VIP Section)

When you shrink the "Weights" (the actual knowledge), you run into a problem: Outliers.
Imagine a graph of the model's knowledge. 99% of the data is a nice, tight cluster in the middle. But 1% of the data consists of extreme, wild values (outliers) that are super important. If you try to squeeze the whole graph into a small box (INT8), those wild outliers get squished and lost, ruining the model's intelligence.

QFT's Solution:

  • The Analogy: Think of a concert hall. 99% of the audience is sitting in the general seating area. But there are 1% of VIPs (the outliers) who need special seats.
  • The Method: QFT separates the crowd. It keeps the 1% VIPs in their original, high-quality format (floating point) so they aren't squished. It then packs the remaining 99% of the crowd tightly into the INT8 format.
  • The Result: You save massive space because the VIPs are rare, but you don't lose any critical information.

5. The "Stack" Trick (The Assembly Line)

Normally, computers need to keep a "backup copy" of everything in high precision to do the math backwards (backpropagation). QFT gets rid of this backup.
Instead, it uses a Stack-based Gradient Flow.

  • The Analogy: Imagine a stack of plates. As you work your way through the layers of the model, you put the "notes" (gradients) on a stack. When you need to update the model, you just pop the top plate off. It's a simple, instant (O(1)) way to move data without needing a massive warehouse to store backups.

The Bottom Line

By combining these tricks, QFT allows you to train a massive AI model on a single consumer-grade graphics card (like an NVIDIA A6000) with less than 30GB of memory.

  • Before: You needed a 100GB+ server (expensive, rare).
  • After: You need a 30GB card (affordable, common).

The paper proves that even though they are using "pencil notes" (INT8) instead of "markers" (FP32), the model learns just as well. It's like proving you can write a masterpiece with a cheap pencil just as beautifully as with a gold pen, as long as you have the right technique.