The Thermodynamic Costs of Simple Linear Regression

Original authors: Samuel H. D'Ambrosia, Sultan M. Daniels, Michael R. DeWeese, Anant Sahai

Published 2026-05-20

📖 6 min read🧠 Deep dive

Original authors: Samuel H. D'Ambrosia, Sultan M. Daniels, Michael R. DeWeese, Anant Sahai

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: The Energy Bill of Learning

Imagine you are trying to teach a robot to draw a straight line through a scatter of dots on a piece of paper. This is a basic task called linear regression. Usually, we think about how accurate the robot is or how fast it learns.

This paper asks a different question: How much energy does it cost to "burn" information to learn that line?

The authors use a concept from physics called Landauer's Principle. Think of it like this: Every time a computer erases a piece of information (like forgetting the old guess to make room for a new one), it must release a tiny amount of heat. It's like shuffling a deck of cards; if you want to organize them perfectly, you have to toss some cards aside, and that "tossing" costs energy. The paper calculates exactly how much energy is wasted just by the act of learning a simple line.

The Main Characters: The Data and The Bits

To understand the cost, the authors look at how computers store numbers. Computers don't store perfect, smooth numbers like $3.14159...$ forever. They chop them up into bits (0s and 1s).

They focus on a specific format called floating-point numbers, which is how modern computers handle decimals. A floating-point number is like a scientific notation:

The Exponent: This is the "zoom level." It tells you if the number is huge (like a galaxy) or tiny (like a grain of sand).
The Mantissa: This is the "detail level." It tells you the specific digits (the 3, the 1, the 4, etc.).

The Big Discovery:
The paper finds that the Mantissa (the detail bits) is the expensive part.

Analogy: Imagine the Exponent is the size of the box you put your data in, and the Mantissa is the number of items inside the box.
The authors show that adding more "zoom levels" (Exponent bits) doesn't cost much energy. But adding more "detail" (Mantissa bits) costs a lot.
Why? Because the computer has to work harder to erase the specific details of the data than it does to just know the general size of the data. If you have a very noisy dataset, the computer has to process a lot of "detail" to find the signal, which generates more heat.

Two Ways to Learn: The Calculator vs. The Hiker

The paper compares two ways the robot learns the line:

Exact Linear Regression (The Calculator):
- How it works: The robot looks at all the dots at once and uses a magic formula to draw the perfect line immediately.
- The Cost: The energy cost is almost entirely determined by how many dots (data points) you give it. The more dots, the more energy it takes to "erase" the old possibilities and settle on the one true line.
Stochastic Gradient Descent / SGD (The Hiker):
- How it works: Instead of seeing all the dots, the robot takes small steps. It looks at a few dots, guesses a line, looks at a few more, and adjusts. It does this thousands of times.
- The Cost: This is even more expensive. Because the robot is constantly "guessing and correcting," it is constantly erasing its previous guesses. The energy cost grows with the number of steps it takes.

The Verdict: In both cases, the amount of data is the biggest driver of energy cost. The more data you feed the machine, the more heat it generates, simply because it has to process and discard more information to find the pattern.

The "Sweet Spot": When More Data is a Waste

The authors then ask a practical question: Is it ever worth using more data?

Imagine you are running a business. You pay for electricity (energy cost) to train your model, and you get paid by customers who use the model (revenue).

If you use a tiny bit of data, your model is bad, and customers don't pay much.
If you use a massive amount of data, your model is perfect, but the electricity bill is huge.

The paper derives a "scaling law" (a rule of thumb) that finds the optimal amount of data.

The Analogy: Imagine you are trying to hit a bullseye with a dart.
- If the dartboard is shaky (high noise), throwing 1,000 darts won't help you hit the center any better than throwing 100. You've just wasted the energy of throwing 900 extra darts.
- The paper shows that because of the "irreducible noise" (the fact that the data is messy), there is a point where adding more data costs more in electricity than the extra profit you get from the slightly better accuracy.

The "Mismatch" Cost: The Hidden Fee

Finally, the paper touches on a concept called Mismatch Cost.

The Analogy: Imagine you are trying to fit a square peg into a round hole. If you force it, you generate friction (heat).
In computing, if the data you start with doesn't match the "perfect" starting state the machine wants to be in to be most efficient, you generate extra heat.
The authors propose a way to estimate this "friction cost" even when we don't know the exact physics of the computer chip. They show that if your data is "weird" or doesn't fit the machine's ideal expectations, you pay an extra energy tax.

Summary

Computing costs heat: Every time a computer learns a simple line, it burns energy to erase information.
Details are expensive: The specific digits (mantissa) in a number cost more energy to process than the general size (exponent).
More data = More heat: The primary driver of energy cost is the sheer volume of data.
There's a limit: Sometimes, using more data to get a slightly better model is a bad deal because the electricity bill outweighs the benefit.
Noise matters: Noisier data requires more energy to process because the computer has to work harder to find the signal.

This paper doesn't tell us how to build better AI for the future; it simply puts a price tag on the physics of learning a very simple math problem, showing us that information has a thermodynamic cost.

Technical Summary: The Thermodynamic Costs of Simple Linear Regression

Problem Statement
The construction and deployment of data-driven models constitute a significant and growing portion of global energy consumption. As physical computing components shrink, understanding how fundamental thermodynamic bounds apply to modeling algorithms becomes increasingly critical. While thermodynamic limits have been studied for discrete algorithms and binary classification tasks, their application to regression algorithms—specifically those operating on real-valued inputs and parameters quantized for digital hardware—remains unexplored. This paper addresses the thermodynamic costs of a foundational modeling algorithm: simple linear regression (a single-parameter model with zero intercept).

Methodology
The authors analyze the thermodynamic costs of two methods for fitting a linear model: exact linear regression (analytic solution) and linear regression via Stochastic Gradient Descent (SGD). The analysis adheres to the following framework:

Physical Model and Accounting Convention: The study adopts the standard accounting convention for cyclic devices (following Wolpert), tracking the thermodynamic costs of logically irreversible computations. It assumes the physical system is composed of bits in thermal equilibrium at temperature $T$ . The energetic cost is bounded by Landauer's principle, where the minimum work required is proportional to the reduction in thermodynamic entropy of the computational system: $\Delta E_{min} = -T \Delta S_{sys}$ .
Quantization and Entropy: Recognizing that modern deep learning systems utilize floating-point representations, the authors derive the discrete entropy of continuous random variables quantized to floating-point numbers. They extend the uniform lattice framework to the non-uniform bin structure of floating-point formats.
- They establish a link between the differential entropy of continuous variables and the discrete entropy of their floating-point counterparts.
- They derive analytic approximations for the entropy of Gaussian-distributed variables quantized to floating-point numbers, distinguishing between the contributions of exponent bits and mantissa bits.
Cost Calculation:
- Exact Regression: The Landauer cost is calculated as the difference between the entropy of the input dataset ( $n$ data points) and the entropy of the output model parameter ( $\hat{w}$ ).
- SGD: The cost is derived by summing the Landauer costs over $\tau$ update steps. The authors model the SGD dynamics using an Ornstein-Uhlenbeck process to approximate the distribution of the model parameter over time.
Scaling Laws: The authors formulate an optimization problem to determine the optimal dataset size ( $n^*$ ) that maximizes profit. This profit function balances the revenue from inference (dependent on generalization error) against the energy cost of training, incorporating prices for energy and inference.
Mismatch Cost (MMC): The paper discusses a method to lower-bound the mismatch cost—the additional entropy production arising when the input distribution differs from the optimal distribution that minimizes total entropy production—beyond the reversible Landauer bound.

Key Contributions and Results

Entropy of Floating-Point Numbers: The paper provides a theoretical foundation for the entropy of floating-point numbers. It demonstrates that for Gaussian variables, the entropy of the mantissa bits is high and relatively constant with respect to variance, while the entropy of exponent bits is low. Specifically, the approximate discrete entropy for a zero-mean Gaussian is $\tilde{H}_s(p) \approx p + 2.46$ bits, where $p$ is the precision.
Dominance of Data Size and Mantissa Bits: In both exact regression and SGD, the thermodynamic cost is primarily driven by the size of the dataset ( $n$ $n$ ) and the precision ( $p$ $p$ ) of the floating-point representation.
- The number of mantissa bits contributes significantly to the cost due to the high entropy of the mantissa.
- Increasing the number of exponent bits has a negligible effect on thermodynamic costs, provided overflows and underflows are avoided.
- Higher signal-to-noise ratios (SNR) in the input data lead to lower thermodynamic costs.
Energy-Accuracy Trade-offs: The derived scaling laws reveal a trade-off between model accuracy (generalization error) and energy cost. The irreducible error of the model's predictions creates a threshold where using more data to increase accuracy is not energetically justified, given the associated energy costs and user demand for inference.
Comparison of Algorithms: The analysis shows that for a fixed task, the optimal dataset size for exact linear regression is generally less than or equal to that for SGD, though SGD hyperparameters (learning rate, batch size) significantly influence this optimum.
Mismatch Cost Bound: The paper presents a variational approach to lower-bound the mismatch cost for algorithms with parameterized continuous input distributions, offering a method to estimate costs beyond the thermodynamically reversible limit.

Significance and Claims
The authors claim that this work provides a theoretical foundation for empirical observations regarding the entropy of neural network weights (e.g., low entropy in exponent bits, high entropy in mantissa bits). The results suggest that:

Thermodynamic Efficiency: Mantissa bits are thermodynamically expensive, while exponent bits are cheap. This supports the efficacy of number formats like bfloat16, which reduce mantissa bits while retaining exponent precision.
Data Quality: Less noisy, more structured data (higher SNR) yields lower fundamental energy costs for training.
Optimization: There exists an energy-optimal dataset size; blindly increasing data size to improve accuracy may be counterproductive from a thermodynamic and economic perspective due to the irreducible noise floor.
Future Directions: The paper positions this single-parameter analysis as a stepping stone toward understanding multi-parameter models, suggesting potential generalizations via the Neural Tangent Kernel. It acknowledges that determining the true entropy flow to the environment ( $\Delta S_{env}$ ) and specific mismatch costs requires further physical modeling of hardware implementations (e.g., CMOS), which is left for future work.

The study does not propose new hardware or specific experimental protocols but rather offers a thermodynamic framework for evaluating the efficiency of existing linear modeling algorithms and their scaling laws.