Demonstrates that perplexity/log-likelihood is a deceptive metric for model distillation, often masking massive drops in actual generation quality.
March 30, 2026
Original Paper
When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
arXiv · 2603.26556
The Takeaway
The authors show that models with nearly identical perplexity can differ by over 20% in autoregressive accuracy. This forces a shift toward generation-based evaluation during the distillation process for any practitioner building smaller, efficient LLMs.
From the abstract
Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure im