Empirically proves that most Transformer layers are redundant, enabling a 54% training cost reduction through non-uniform budget allocation.
March 23, 2026
Original Paper
Anatomical Heterogeneity in Transformer Language Models
arXiv · 2603.19348
The Takeaway
The study identifies 'anti-layers' whose removal actually improves performance and a 'critical core' of layers that are 10^7 times more important than others. Allocating compute according to this anatomical heterogeneity yields 4.7x lower loss at identical parameter counts.
From the abstract
Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathema