Recovers short-text performance in context-extended LLMs using 60x less data than current state-of-the-art distillation methods.
April 2, 2026
Original Paper
LinearARD: Linear-Memory Attention Distillation for RoPE Restoration
arXiv · 2604.00004
AI-generated illustration
The Takeaway
Context extension typically degrades short-context performance. This paper shows how to restore those capabilities by aligning attention distributions using a linear-memory kernel, requiring only 4M tokens compared to the standard 256M.
From the abstract
The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency wit