AI & ML Efficiency Breakthrough

Enables RMSNorm to reuse MXFP8 block scales, reducing the reduction operation size by 32x with a 2.4x kernel speedup.

March 16, 2026

Original Paper

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi

arXiv · 2603.13180

The Takeaway

As matrix multiplication gets faster via low-precision formats, normalization becomes a bottleneck. This method provides a drop-in replacement that speeds up end-to-end 8B-model layers by 2.6%.

From the abstract

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the