Efficiency Breakthrough
/ Category lead
Recovers short-text performance in context-extended LLMs using 60x less data than current state-of-the-art distillation methods.
Context extension typically degrades short-context performance. This paper shows how to restore those capabilities by aligning attention distributions using a linear-memory kernel, requiring only 4M tokens compared to the standard 256M.