Scaling Laws for Precision in High-Dimensional Linear Regression

Imagine you are trying to teach a giant, super-smart robot (a Large Language Model) to understand the world. To do this, you need three main ingredients:

The Brain Size (Model Size): How many neurons the robot has.
The Textbooks (Dataset Size): How many pages of data you feed it.
The Notebook Quality (Precision): How detailed the robot's internal notes are.

Usually, to make the robot smarter, you just make the brain bigger and give it more textbooks. But this costs a fortune in electricity and computer power. So, engineers started using "low-precision" training. Think of this as switching from writing in a high-definition, 4K notebook to a cheap, low-resolution sketchpad. It saves space and energy, but does it make the robot dumber?

This paper is a theoretical detective story that answers exactly how and why using a sketchpad changes the robot's learning ability. The authors discovered that not all sketchpads are created equal. There are two distinct types, and they break the robot's brain in completely different ways.

The Two Types of "Sketchpads"

The paper divides low-precision training into two categories, which we can think of as The "Proportional" Notebook and The "Flat" Notebook.

1. The "Proportional" Notebook (Multiplicative Quantization)

Real-world example: Floating-point numbers (like FP8 or FP32).
The Analogy: Imagine you are drawing a map. In this notebook, the size of your pencil strokes depends on the size of the object you are drawing.
- If you are drawing a massive mountain, your pencil stroke is thick and bold.
- If you are drawing a tiny ant, your pencil stroke is incredibly fine and delicate.
The Result: The "noise" (the fuzziness of the drawing) scales with the object. The tiny details (the ant) still get drawn, just with a little bit of fuzz.
The Paper's Finding: This type of notebook does not shrink the robot's brain. Even though the notes are fuzzy, the robot can still learn from every single one of its neurons. It effectively keeps the full "brain size" intact, but it makes the "textbooks" slightly less effective because the noise amplifies a bit.

2. The "Flat" Notebook (Additive Quantization)

Real-world example: Integer numbers (like INT8).
The Analogy: Imagine a notebook where every single line you draw has the exact same thickness, no matter what you are drawing.
- If you try to draw a massive mountain, a thick line is fine.
- But if you try to draw a tiny ant, that thick line completely covers the ant! The ant disappears into a blob of ink.
The Result: This notebook introduces a "floor" of noise. It's like a constant static hiss in a radio that is loud enough to drown out the quiet whispers.
The Paper's Finding: This type of notebook shrinks the robot's brain. Because the "thick lines" drown out the tiny details, the robot effectively loses access to its smaller, more subtle neurons. It's as if the robot's brain physically got smaller; it can no longer use its full capacity because the "tail end" of its knowledge is just too noisy to be useful.

The Big Discovery: A Critical Split

Before this paper, people were arguing about whether low-precision training just added a little "error penalty" (like a bad signal) or if it actually reduced the model's capacity (like shrinking the brain).

The authors proved that both are true, but it depends on which "notebook" you use:

If you use Floating-Point (Proportional): You get a slight penalty, but your brain stays the same size. You just need more data to overcome the noise.
If you use Integer (Flat): You get a penalty, AND your brain effectively shrinks. You lose the ability to learn complex, subtle patterns because the "quiet" parts of your brain are drowned out by the noise.

Why This Matters for the Future

This isn't just math for math's sake. It gives engineers a rulebook for building AI in the future.

If you want to save money: You can use low-precision training.
If you use Floating-Point: You can keep your model huge and just feed it more data.
If you use Integer (which is cheaper): You have to accept that your model is effectively smaller. You might need to design a different architecture or accept that you can't just scale up the model size indefinitely without hitting a wall.

The Bottom Line

Think of training an AI like building a skyscraper.

High Precision is using perfect, laser-cut steel beams.
Low Precision is using slightly warped, cheaper beams.

This paper tells us:

If the warping is proportional (bigger beams warp a bit more, tiny beams warp a tiny bit), the building stands tall, but you might need a few extra workers (data) to fix the wobbles.
If the warping is flat (every beam has the same amount of bend), the top floors of the building (the complex, subtle parts of the AI) will collapse because the tiny beams can't support the weight. The building effectively becomes shorter.

The authors have provided the mathematical proof to help us decide which "cheap beams" to use so we can build the tallest, smartest AI towers possible without running out of money.

1. Problem Statement

The rapid scaling of Large Language Models (LLMs) has made low-precision training (e.g., FP8, INT8) essential for managing computational and memory costs. However, the theoretical understanding of how numerical precision interacts with model size ( $M$ ) and dataset size ( $N$ ) remains largely empirical.

The Gap: Existing empirical scaling laws propose conflicting functional forms for quantization effects:
1. Effective Size Reduction: Quantization reduces the effective model capacity ( $M_{eff} < M$ ).
2. Additive Error: Quantization acts as an additive error term ( $\delta$ ) without changing model capacity.
The Objective: The authors aim to provide a rigorous theoretical framework to determine which mechanism governs low-precision training, specifically distinguishing between multiplicative (signal-dependent, e.g., Floating Point) and additive (signal-independent, e.g., Integer) quantization schemes.

2. Methodology

The authors analyze low-precision training within a high-dimensional sketched linear regression framework, extending recent theoretical work on neural scaling laws (Lin et al., 2024) to include quantization.

Setup:
- Model: A linear model $f_v(x) = \langle v, Sx \rangle$ with $M$ parameters, where $S$ is a fixed Gaussian sketch matrix.
- Data: High-dimensional data vectors $x \in \mathbb{R}^p$ with a covariance matrix satisfying a power-law spectrum ( $\lambda_i \propto i^{-a}$ ).
- Algorithm: One-pass Stochastic Gradient Descent (SGD) with constant stepsize and iterate averaging, operating on quantized features, labels, and gradients.
Quantization Models:
The paper formalizes two distinct quantization error models based on the conditional second moment of the error:
1. Multiplicative Quantization: Error variance scales with the signal magnitude ( $\text{Var}[\epsilon|x] \propto xx^\top$ ). This mimics Floating Point (FP) formats where precision is relative to the value.
2. Additive Quantization: Error variance is independent of the signal ( $\text{Var}[\epsilon|x] \propto I$ ). This mimics Integer (INT) formats where the bin width is fixed.
Analysis Technique:
- Decomposition of population risk into Irreducible Risk, Approximation Error, and Excess Risk.
- Derivation of upper and lower bounds for the excess risk under both quantization schemes.
- Utilization of bias-variance decomposition and spectral analysis of the quantized covariance matrix $H_f^{(q)}$ .

3. Key Contributions

The paper's primary contribution is establishing a critical dichotomy in how multiplicative and additive quantization affect scaling laws:

Multiplicative Quantization (FP-like):
- Effective Model Size ( $M_{eff}$ ): Remains equal to the full model size ( $M_{eff} \approx M$ ). The signal-dependent nature of the error ensures that the relative spectral structure is preserved; errors in the tail subspace decay alongside the signal.
- Effective Data Size ( $N_{eff}$ ): Reduced due to noise amplification and spectral distortion.
- Error Structure: Introduces an additive error floor but preserves the scaling exponent of the model size.
Additive Quantization (INT-like):
- Effective Model Size ( $M_{eff}$ ): Strictly reduced ( $M_{eff} < M$ ). The constant noise floor overwhelms the intrinsic signal in the spectral tail (the "tail subspace"), rendering those dimensions uninformative for learning.
- Effective Data Size ( $N_{eff}$ ): Reduced due to noise amplification and spectral distortion.
- Error Structure: Introduces an additive error floor and fundamentally shrinks the usable model capacity.
Theoretical Bounds:
- The authors derive the first population risk lower bounds for low-precision training, validating that the observed effects (additive error and effective data reduction) are fundamental, not just artifacts of upper-bound analysis.
- They unify the scaling laws into a general form:
  $R_M(v_N) \lesssim R^* + \frac{1}{M_{eff}^{a-1}} + \frac{1}{N_{eff}^{(a-1)/a}} + \delta(\epsilon)$
  where the behavior of $M_{eff}$ and $N_{eff}$ depends entirely on the quantization type.

4. Key Results

Theoretical Validation of Empirical Observations:
- The theory explains why Integer Quantization (e.g., INT8) often requires model shrinkage or performs worse than expected: it effectively reduces $M_{eff}$ .
- The theory explains why Floating Point Quantization (e.g., FP8) can scale to trillion-token workloads without significant accuracy loss: it preserves $M_{eff}$ , acting primarily as an additive noise term.
Scaling Exponents:
- The excess risk scales as $R \sim N_{eff}^{-(a-1)/a}$ and $R \sim M_{eff}^{-(a-1)}$ .
- Experiments confirm these exponents. For $a=1.5$ , the fitted exponents are $\alpha \approx -0.5$ and $\beta \approx -0.33$ , matching theoretical predictions.
Quantization Coefficients:
- The paper defines compound coefficients ( $\epsilon_2, \epsilon_3$ ) that aggregate errors from data, sketch, feature, label, and gradient quantization.
- For additive quantization, these coefficients scale with dimension $p$ and model size $M$ , leading to the $M_{eff}$ reduction.

5. Significance and Implications

Principled Resource Allocation: The work provides a theoretical basis for optimizing the joint allocation of compute, memory, and precision. It suggests that for integer quantization, simply increasing model size may yield diminishing returns if the effective capacity is capped by quantization noise.
Algorithm Design: It highlights the importance of choosing quantization schemes that preserve spectral structure (multiplicative) when model capacity is the bottleneck, or managing the "effective model shrinkage" when using additive quantization.
Bridging Theory and Practice: By resolving the conflict between "effective size reduction" and "additive error" hypotheses, the paper unifies disparate empirical findings in the LLM community under a single rigorous mathematical framework.

In summary, this paper demonstrates that the type of quantization error (multiplicative vs. additive) dictates whether low-precision training reduces the effective dataset size or the effective model size, providing a crucial theoretical guide for designing next-generation low-precision training protocols.

Scaling Laws for Precision in High-Dimensional Linear Regression

The Two Types of "Sketchpads"

1. The "Proportional" Notebook (Multiplicative Quantization)

2. The "Flat" Notebook (Additive Quantization)

The Big Discovery: A Critical Split

Why This Matters for the Future

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields