AI & ML Efficiency Breakthrough

Scales curvature-aware bilevel optimization to BERT-sized models using KFAC, significantly outperforming standard gradient unrolling.

April 1, 2026

Original Paper

Efficient Bilevel Optimization with KFAC-Based Hypergradients

Disen Liao, Felix Dangel, Yaoliang Yu

arXiv · 2603.29108

The Takeaway

Bilevel optimization (used in meta-learning and NAS) is usually hindered by expensive Hessian products; this paper provides a scalable way to include curvature information. This makes complex meta-optimization tasks viable for modern, large-scale architectures.

From the abstract

Bilevel optimization (BO) is widely applicable to many machine learning problems. Scaling BO, however, requires repeatedly computing hypergradients, which involves solving inverse Hessian-vector products (IHVPs). In practice, these operations are often approximated using crude surrogates such as one-step gradient unrolling or identity/short Neumann expansions, which discard curvature information. We build on implicit function theorem-based algorithms and propose to incorporate Kronecker-factored