MXNorm: Reusing MXFP block scales for efficient tensor normalisation

The paper introduces MXNorm, an efficient normalization method that reuses existing MXFP8 block scales to estimate RMS, significantly reducing reduction overhead and delivering measurable speedups in transformer training without compromising accuracy.

Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you are running a massive, high-speed factory that builds intelligent robots (these are your AI models). For years, the biggest bottleneck in your factory was the assembly line where the heavy lifting happens: multiplying huge matrices of numbers.

To speed things up, you upgraded your assembly line to use tiny, lightweight parts (low-precision numbers like MXFP8). This made the assembly line incredibly fast—80 times faster than before!

The New Problem:
However, while the assembly line is now a blur of speed, the quality control inspectors (normalization layers) are still working at a snail's pace. They are checking every single part with a heavy, high-precision magnifying glass before the parts move to the next stage. Because the assembly line is so fast, the inspectors are now the bottleneck, slowing down the whole factory.

The Paper's Solution: MXNorm
The authors of this paper propose a clever trick called MXNorm. Here is how it works, using a simple analogy:

1. The Old Way (RMSNorm + Casting)

Imagine you have a box of 32 marbles (a "block" of data).

  1. Inspect: You measure the size of every single marble to find the average size (RMS).
  2. Rescale: You adjust the marbles so they fit a standard size.
  3. Pack: You put them into a tiny, lightweight box (casting to low precision).

This requires two separate trips through the box: one to measure, one to pack. It's slow.

2. The New Way (MXNorm)

The authors realized that when you pack the marbles into the lightweight box, you already have to measure the biggest marble in the box to decide how to shrink them all down. This measurement is called the "Block Scale."

MXNorm says: "Why measure the marbles twice? Let's use the measurement we already took for packing to do the quality control check!"

Instead of calculating a complex average of every single marble, they use the largest marble in the group (the "Block Absmax") to estimate the average size.

  • The Magic: They proved mathematically that if you know the size of the biggest marble in a group, you can guess the average size of the whole group with surprising accuracy.
  • The Result: You skip the slow, heavy inspection step. You measure once, pack once, and move on.

Why Does This Matter?

  • Speed: By reusing the "packing measurement" for the "quality check," they reduced the work needed by 32 times for this specific step.
  • Real-World Impact: They tested this on a famous AI model (Llama 3).
    • Accuracy: The robots learned just as well as before. The "guess" was good enough.
    • Speed: The factory ran 2.4 times faster on this specific step. While that sounds small, in a massive AI model with thousands of layers, it adds up to a 2.6% overall speedup. In the world of AI, that's a huge win.

The "Outlier" Catch

There was one hiccup. When they tried a simple version of this "guessing" method (using a simple average), the factory occasionally had a "meltdown" (a loss spike) when a single, giant, weird marble (an outlier) appeared.

They fixed this by using a slightly smarter way of guessing (using a "quadratic mean" instead of a simple average). This kept the factory stable even when weird marbles showed up.

The Bottom Line

MXNorm is like realizing you don't need to weigh every single apple in a crate to know if the crate is heavy enough for shipping. You just look at the heaviest apple, and you can estimate the rest.

By reusing the data you already calculate for one job (packing data into low-precision formats) to do another job (normalizing the data), they removed a major traffic jam in AI training. This allows us to build bigger, smarter AI models faster without needing to invent new, expensive hardware.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →