How Quantization Shapes Bias in Large Language Models

Imagine you have a brilliant, but incredibly heavy, library of knowledge (a Large Language Model). It knows everything, but it's so heavy that it's hard to carry around on a phone or a small computer. To make it portable, you decide to "compress" it. You take the massive, high-definition books and shrink them down into tiny, pocket-sized pamphlets. This process is called Quantization.

This paper asks a very important question: When we shrink these books to make them lighter, do we accidentally tear out the pages about fairness, or do we accidentally scribble in some bad stereotypes?

Here is the breakdown of what the researchers found, using some everyday analogies.

1. The "Shrinking" Process

Think of the AI model as a giant, high-resolution photograph.

Original Model: A 4K photo. Every pixel is perfect, every detail is sharp.
Quantization: You compress that photo into a low-resolution JPEG to save space.
The Goal: You want the photo to still look good enough to recognize a face, but you don't care if the tiny details are slightly blurry.

The researchers tested different ways of shrinking the photo (different "strategies" like GPTQ, AWQ, and SmoothQuant) and different levels of compression (from "slightly blurry" to "very blocky").

2. The Good News: The "Toxicity Filter"

One of the most surprising findings is that shrinking the model actually made it less toxic.

The Analogy: Imagine a loud, rowdy party (the original AI). Sometimes, the guests say mean, offensive things. When you shrink the model, it's like turning down the volume on the speakers and asking everyone to speak in hushed tones.
The Result: The "quantized" models generated significantly fewer swear words and hateful comments. It seems that the compression process accidentally acts as a "moral filter," smoothing out the rough edges and making the AI a bit more polite.

3. The Bad News: The "Stereotype Amplifier"

However, while the AI became less mean, it became more stubborn about stereotypes.

The Analogy: Imagine a student who is trying to answer a test question.
- The Original Model: Knows the facts perfectly. If asked, "Who is the nurse?" it might say, "It could be a man or a woman," because it knows the data.
- The Compressed Model: Because it's "blurry" and less certain, it starts guessing based on the most obvious, cliché patterns it remembers. If asked, "Who is the nurse?" it's more likely to guess "Woman" just because that's the most common pattern in its training data, even if the context suggests otherwise.
The Result: The compressed models were more likely to make unfair decisions (like assuming a man is the boss and a woman is the assistant) and were more likely to rely on old-fashioned stereotypes. They didn't become more evil, but they became less thoughtful and more reliant on lazy assumptions.

4. The "Reasoning" Superpower

The paper also looked at "Reasoning Models" (AIs that are taught to think step-by-step, like a math tutor) versus regular models.

The Analogy:
- Regular Model: A fast runner who sprints to the finish line. They might trip over a stereotype because they are rushing.
- Reasoning Model: A hiker who stops to look at the map. They think, "Wait, is this actually true?" before answering.
The Result: The "Reasoning" models were naturally less biased to begin with. But here's the catch: When you compress them, they lose that superpower. If you shrink a "Reasoning" model too much, it stops thinking step-by-step and starts guessing, just like the regular models. The compression "dumbs down" their ability to be fair.

5. The "Fairness" Gap

When the researchers asked the AI to make decisions (like "Who gets the loan?"), the compressed models were slightly more unfair.

The Analogy: Imagine a judge who is tired and has a headache (the compressed model). They are more likely to make a quick, biased decision based on a gut feeling rather than carefully weighing the evidence.
The Result: The compressed models were more likely to pick one group over another unfairly, especially when the compression was very aggressive (making the model very small).

The Big Takeaway

The paper concludes that Quantization is a trade-off.

Pros: It makes the AI faster, cheaper to run, and surprisingly, less toxic.
Cons: It makes the AI more stereotypical and less fair in its decisions. It also makes the AI "dumber" at thinking things through.

The Final Lesson:
If you want to run an AI on a phone or a small device, you have to compress it. But you can't just compress it blindly. You have to be careful. If you shrink it too much (like going from a 4K photo to a tiny thumbnail), you might save space, but you lose the "humanity" and fairness of the model. You get a polite robot that is also a bit prejudiced and not very smart.

The researchers are telling us: "Be careful with your compression settings. Don't just look at how small the model is; check if it's still being fair."

Here is a detailed technical summary of the paper "How Quantization Shapes Bias in Large Language Models".

1. Problem Statement

While model quantization is widely adopted to improve the efficiency and deployment feasibility of Large Language Models (LLMs) by reducing memory footprint and inference latency, its impact on social bias remains under-explored. Previous research has largely focused on quantization's effect on task performance (e.g., accuracy, perplexity) or used limited benchmarks. There is a critical gap in understanding how aggressive compression strategies (weight-only vs. weight-activation) and varying bit-widths affect specific dimensions of social bias, including stereotypes, fairness, toxicity, and sentiment, across different demographic subgroups and model architectures.

2. Methodology

The authors conducted a comprehensive empirical evaluation involving multiple dimensions:

Models Evaluated:
- Architectures: LLaMA-3.1-8B and Qwen2.5-14B (Non-reasoning).
- Reasoning Models: DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.
- Demographics: Gender, Race, and Religion (and their specific subgroups).
Quantization Strategies:
- GPTQ (Generalized Post-Training Quantization): Weight-only.
- AWQ (Activation-aware Weight Quantization): Weight-only.
- SQ (SmoothQuant): Weight-activation quantization.
- Bit-widths: Tested various configurations including W3, W4, W8 for weights, and W4A8, W8A8 for weight-activation.
Evaluation Framework:
- Metrics: Employed both probability-based metrics (e.g., first-token probability, perplexity) and generated text-based metrics (e.g., answer retrieval, toxicity scoring, sentiment analysis).
- Benchmarks:
  - Capabilities: MMLU, MathArena (AIME, CMIMC, HMMT).
  - Stereotypes: StereoSet, RedditBias, WinoBias, BBQ.
  - Fairness: DiscrimEval, DiscrimEvalGen, DT-Fairness.
  - Toxicity & Sentiment: BOLD, DT-Toxicity.
- Statistical Rigor: Used approximate randomization tests ( $\alpha=0.05$ ) to determine significance, accounting for single-run evaluation limitations.

3. Key Contributions

Multi-Dimensional Bias Analysis: Unlike prior work that often focused on a single bias type or metric, this study evaluates belief-level (stereotypes), decision-level (fairness), and language-level (toxicity/sentiment) biases simultaneously.
Differentiation of Metrics: The paper highlights a crucial discrepancy between probability-based and text-based metrics. It demonstrates that probability-based metrics can be misleading (suggesting reduced bias due to increased model uncertainty), while text-based metrics reveal the true negative impact of quantization on generative behavior.
Reasoning Model Evaluation: It is the first study to empirically evaluate the social bias of reasoning-oriented models (e.g., DeepSeek-R1) under quantization, comparing them against standard instruction-tuned models.
Granular Strategy Comparison: By testing 7 distinct quantization settings across different strategies, the authors disentangle the effects of bit-width reduction from specific algorithmic strategies (e.g., the impact of activation quantization in SmoothQuant).

4. Key Results

A. Impact on Capabilities

Performance Degradation: Aggressive quantization (W3, W4A8) significantly reduces model capabilities (MMLU accuracy, reasoning scores).
Bit-width Sensitivity: W8 and W8A8 generally preserve performance close to the original model, while W3 causes substantial drops, particularly in LLaMA-based models.

B. Impact on Social Bias

Stereotypes (Belief-Level):
- Contradictory Findings: Probability-based metrics (e.g., StereoSet perplexity) showed a slight reduction in bias scores. However, the authors attribute this to increased model uncertainty (lower log-likelihood across all sentences) rather than genuine debiasing.
- Text-Based Reality: Generated text-based metrics (WinoBias, BBQ) revealed that quantization amplifies stereotypes. Models became more likely to generate stereotypical answers, especially in ambiguous contexts.
Fairness (Decision-Level):
- Quantization generally increases unfairness. In generative fairness tasks (DiscrimEvalGen), aggressive quantization reduced the rate of "unbiased" answers and increased the disparity in subgroup selection rates.
- DT-Fairness results showed quantized models were more prone to making positive predictions for specific subgroups (higher Demographic Parity Difference).
Toxicity (Language-Level):
- Reduction in Raw Toxicity: Quantization significantly reduces the raw toxicity of generated text (up to 35% reduction in some cases).
- Mechanism: This reduction is not merely a byproduct of shorter generation lengths; it stems from changes in the model's internal representations.
- Subgroup Bias: Despite lower raw toxicity, the relative toxicity differences between demographic subgroups remained largely unchanged.
Sentiment:
- Quantization has a minor impact on sentiment, generally shifting positive sentiment toward neutral but not introducing significant negative bias toward specific groups.

C. Impact Across Models and Subgroups

Reasoning Models: Unquantized reasoning models are inherently less toxic and more fair than non-reasoning counterparts. This advantage is largely preserved after quantization, though quantization still introduces specific biases (e.g., increased religious bias in BBQ for reasoning models).
Subgroup Consistency: Quantization does not typically exacerbate the gap between demographic subgroups (e.g., male vs. female) more than the original model; the relative ordering of bias magnitude remains consistent.
Strategy Differences:
- SmoothQuant (SQ): Generally had the strongest impact on bias (both positive and negative) and the most significant degradation in capabilities.
- GPTQ: Proved most effective for toxicity reduction while maintaining reasonable generation quality.
- AWQ: Performed slightly better in preserving capabilities and fairness compared to GPTQ in some settings.

5. Significance and Implications

Trade-off Awareness: The study underscores a complex trade-off: while quantization can effectively reduce toxic output (a safety benefit), it simultaneously risks amplifying stereotypes and unfairness in decision-making tasks.
Metric Selection: Practitioners must rely on generated text-based metrics rather than probability-based proxies when evaluating bias in compressed models, as the latter may mask the reinforcement of harmful stereotypes.
Deployment Guidelines: The findings suggest that aggressive quantization (e.g., W3 or W4A8) should be used with caution in high-stakes applications involving social decision-making or content generation, as it may degrade ethical alignment even if it improves efficiency.
Future Research: The paper calls for fine-grained bias evaluation in real-world deployments and highlights the need to investigate causal mechanisms (via interpretability) behind how quantization alters internal representations of sensitive concepts.

In conclusion, quantization is not a neutral operation regarding social bias; it reshapes the model's ethical landscape, reducing toxicity but potentially entrenching stereotypes and unfairness, necessitating careful, context-aware application strategies.