AI & ML Efficiency Breakthrough

CARE provides a recipe for converting standard GQA models into high-efficiency Multi-head Latent Attention (MLA) architectures.

March 19, 2026

Original Paper

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song

arXiv · 2603.17946

The Takeaway

By accounting for activation covariance and non-uniform layer importance, CARE allows practitioners to adopt the KV-cache-saving benefits of MLA (as seen in DeepSeek-V3) for existing pretrained weights. It significantly lowers the hardware requirements for long-context inference.

From the abstract

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ign

Read the original paper →

← Back to today's papers