REOPOLD achieves 10x better sample efficiency in reasoning distillation, enabling 7B models to match 32B teachers with significantly less training data.
March 13, 2026
Original Paper
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
arXiv · 2603.11137
The Takeaway
By interpreting on-policy distillation as policy optimization and applying reward clipping/dynamic sampling, this framework stabilizes the transfer of reasoning. It specifically solves the 'negative transfer' problem where a student fails to learn from a teacher that is too far ahead in capability.
From the abstract
On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imit