MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Imagine you are running a massive, high-speed factory that builds intelligent robots (these are your AI models). For years, the biggest bottleneck in your factory was the assembly line where the heavy lifting happens: multiplying huge matrices of numbers.

To speed things up, you upgraded your assembly line to use tiny, lightweight parts (low-precision numbers like MXFP8). This made the assembly line incredibly fast—80 times faster than before!

The New Problem:
However, while the assembly line is now a blur of speed, the quality control inspectors (normalization layers) are still working at a snail's pace. They are checking every single part with a heavy, high-precision magnifying glass before the parts move to the next stage. Because the assembly line is so fast, the inspectors are now the bottleneck, slowing down the whole factory.

The Paper's Solution: MXNorm
The authors of this paper propose a clever trick called MXNorm. Here is how it works, using a simple analogy:

1. The Old Way (RMSNorm + Casting)

Imagine you have a box of 32 marbles (a "block" of data).

Inspect: You measure the size of every single marble to find the average size (RMS).
Rescale: You adjust the marbles so they fit a standard size.
Pack: You put them into a tiny, lightweight box (casting to low precision).

This requires two separate trips through the box: one to measure, one to pack. It's slow.

2. The New Way (MXNorm)

The authors realized that when you pack the marbles into the lightweight box, you already have to measure the biggest marble in the box to decide how to shrink them all down. This measurement is called the "Block Scale."

MXNorm says: "Why measure the marbles twice? Let's use the measurement we already took for packing to do the quality control check!"

Instead of calculating a complex average of every single marble, they use the largest marble in the group (the "Block Absmax") to estimate the average size.

The Magic: They proved mathematically that if you know the size of the biggest marble in a group, you can guess the average size of the whole group with surprising accuracy.
The Result: You skip the slow, heavy inspection step. You measure once, pack once, and move on.

Why Does This Matter?

Speed: By reusing the "packing measurement" for the "quality check," they reduced the work needed by 32 times for this specific step.
Real-World Impact: They tested this on a famous AI model (Llama 3).
- Accuracy: The robots learned just as well as before. The "guess" was good enough.
- Speed: The factory ran 2.4 times faster on this specific step. While that sounds small, in a massive AI model with thousands of layers, it adds up to a 2.6% overall speedup. In the world of AI, that's a huge win.

The "Outlier" Catch

There was one hiccup. When they tried a simple version of this "guessing" method (using a simple average), the factory occasionally had a "meltdown" (a loss spike) when a single, giant, weird marble (an outlier) appeared.

They fixed this by using a slightly smarter way of guessing (using a "quadratic mean" instead of a simple average). This kept the factory stable even when weird marbles showed up.

The Bottom Line

MXNorm is like realizing you don't need to weigh every single apple in a crate to know if the crate is heavy enough for shipping. You just look at the heaviest apple, and you can estimate the rest.

By reusing the data you already calculate for one job (packing data into low-precision formats) to do another job (normalizing the data), they removed a major traffic jam in AI training. This allows us to build bigger, smarter AI models faster without needing to invent new, expensive hardware.

1. Problem Statement

The rapid scaling of Large Language Models (LLMs) has been driven by massive improvements in low-precision matrix multiplication (matmul) throughput on AI accelerators (e.g., an 80x improvement from V100 to Rubin architectures). However, other computational components, specifically reductions (like calculating Root Mean Square for normalization) and elementwise operations, have not kept pace.

The Bottleneck: While matmul performance has surged, memory bandwidth and CUDA core throughput for reductions have improved only marginally (5x–9x).
The Context: In modern Pre-Norm Transformer architectures (e.g., Llama 3), RMSNorm is applied immediately before quantization to Microscaling (MX) formats (e.g., MXFP8).
The Inefficiency: Standard RMSNorm requires a full pass over the hidden dimension to calculate the Root Mean Square (RMS) of activations. This is then followed by a separate pass to calculate block scales for MX quantization. This results in redundant memory access and computational overhead, which becomes a significant bottleneck as matmul speeds increase.

2. Methodology: MXNorm

The authors propose MXNorm, a drop-in replacement for RMSNorm that fuses normalization with MX quantization by reusing the block scales already calculated for the MX cast.

Core Insight

MX formats (like MXFP8) partition tensors into blocks (e.g., size 32) and compute a block scale ( $S$ ) based on the block absolute maximum ($absmax$) of that block. The paper observes that:

Both RMSNorm and MX quantization gather statistics along the hidden dimension.
There is a mathematical relationship between the RMS of a distribution and the generalized power mean of its block absolute maxima.

Mathematical Formulation

Instead of calculating the full RMS ( $\rho$ ) over the entire hidden dimension, MXNorm estimates the inverse RMS ( $\tilde{\rho}$ ) using the block absolute maxima ( $\tilde{m}$ ):

$\tilde{\rho}_t = \tilde{c}(p, B) \cdot \left( \frac{1}{K} \sum_{k=1}^{K} \tilde{m}_{tk}^p \right)^{-1/p}$

Where:

$K$ is the number of blocks.
$B$ is the block size.
$p$ is the power mean exponent (tested with $p=1$ and $p=2$ ).
$\tilde{c}(p, B)$ is a correction factor derived via Monte Carlo sampling to ensure the estimate converges to the true RMS for Gaussian distributions.

Implementation Details

Fusion: The normalization and quantization are fused into a single kernel. The estimated $\tilde{\rho}$ is used to rescale the input before computing the block scales and quantized values.
Gradient Handling: To maintain training stability, the authors use a straight-through estimator for gradients. They cache the high-precision input and the estimated inverse RMS to compute gradients in high precision, avoiding the need to requantize intermediate gradients.
Gain Parameter ( $\gamma$ ): In standard RMSNorm, a learnable gain is applied element-wise. In MXNorm, to avoid complex broadcasting in quantized formats, the gain is fused into the subsequent linear layer's weights ( $W_{\gamma} = W \cdot \gamma$ ).

3. Key Contributions

Algorithmic Innovation: Introduced MXNorm, which reduces the reduction size for normalization by 32x (by operating on block scales rather than individual elements) and fuses two distinct operations (Norm + Cast) into one.
Theoretical Proof: Provided a theorem proving that the generalized $p$ -mean of block absolute maxima converges to the true RMS up to a multiplicative constant, validating the approximation.
Stability Analysis: Identified that the upper bound of the normalized output is critical for training stability.
- RMSNorm bounds output by $\sqrt{D}$ (hidden dimension).
- MXNorm( $p=2$ ) bounds output by $\approx \sqrt{K}$ (number of blocks).
- MXNorm( $p=1$ ) has a much looser bound ( $O(K)$ ), leading to instability.
Hardware Efficiency: Demonstrated practical speedups on commercial hardware (NVIDIA GB200) using standard software tools (torch.compile).

4. Experimental Results

The method was validated on Llama 3 models ranging from 125M to 8B parameters, pre-trained on the SlimPajama dataset.

Training Stability & Accuracy:
- Small Scale (125M, 1B): Both MXNorm variants ( $p=1$ and $p=2$ ) showed minimal loss in accuracy compared to the baseline.
- Large Scale (8B):
  - MXNorm ( $p=1$ ): Failed to scale. It suffered from catastrophic loss spikes due to the loose output bounds allowing outlier features to explode.
  - MXNorm ( $p=2$ ): Successfully matched the training loss (2.126 vs 2.132 baseline) and zero-shot performance of the RMSNorm baseline.
- Conclusion: Using the quadratic mean ( $p=2$ ) is essential for stability in large models as it provides tighter bounds on extreme values, similar to standard RMSNorm.
Performance Speedups:
- Kernel Level: MXNorm achieved up to 2.4x speedup over fused RMSNorm + MXCast kernels on isolated operations.
- System Level:
  - MXFP8: 1.3% speedup in Llama 3 8B transformer layers.
  - NVFP4: 2.6% speedup in Llama 3 8B transformer layers.
- The speedup is more pronounced in lower-precision formats (NVFP4) where non-matmul operations become the dominant bottleneck.

5. Significance and Future Work

Efficiency: MXNorm addresses the "reduction bottleneck" in next-generation AI accelerators. As matmul speeds continue to skyrocket (e.g., Rubin architecture), optimizing normalization becomes critical for overall throughput.
Simplicity: It requires no new hyperparameters and can be implemented as a drop-in replacement in existing LLM architectures (Pre-Norm).
Generalizability: The approach is not limited to MXFP8; it can be applied to other block-quantization schemes (e.g., INT2, ternary) and other quantization methods where block scales are computed (e.g., VS-Quant).
Broader Impact: The paper highlights that normalization is just one of several components (alongside rotary positional embeddings and gated linear units) that lag behind matmul optimizations, suggesting a new direction for hardware-aware algorithm design.

In summary, MXNorm is a highly efficient, mathematically grounded technique that leverages existing quantization metadata to accelerate tensor normalization, enabling faster training and inference for large-scale LLMs on low-precision hardware without sacrificing model quality.

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

1. The Old Way (RMSNorm + Casting)

2. The New Way (MXNorm)

Why Does This Matter?

The "Outlier" Catch

The Bottom Line

1. Problem Statement

2. Methodology: MXNorm

Core Insight

Mathematical Formulation

Implementation Details

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank