AI & ML Efficiency Breakthrough

Introduces Mixture-of-Depths Attention (MoDA) to solve signal degradation in deep LLMs with hardware-efficient implementation.

March 17, 2026

Original Paper

Mixture-of-Depths Attention

Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

arXiv · 2603.15619

The Takeaway

It addresses the 'residual dilution' problem where features fade in deep models by allowing attention heads to directly access KV pairs from earlier layers. Most importantly, it achieves 97.3% efficiency of FlashAttention-2, making it a viable, high-performance alternative to standard attention for scaling model depth.

From the abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-e

Read the original paper →

← Back to today's papers