AI & ML Efficiency Breakthrough

REOPOLD achieves 10x better sample efficiency in reasoning distillation, enabling 7B models to match 32B teachers with significantly less training data.

March 13, 2026

Original Paper

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron

arXiv · 2603.11137

The Takeaway

By interpreting on-policy distillation as policy optimization and applying reward clipping/dynamic sampling, this framework stabilizes the transfer of reasoning. It specifically solves the 'negative transfer' problem where a student fails to learn from a teacher that is too far ahead in capability.

From the abstract

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imit

Read the original paper →

← Back to today's papers