Fixes the inherent instability of on-policy distillation in LLMs using local support matching and top-p rollout sampling.
March 27, 2026
Original Paper
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
arXiv · 2603.25562
The Takeaway
On-policy distillation is key for math and agentic tasks but often fails due to 'rollout drift.' This paper provides the theoretical and empirical fixes needed to make sequence-level RL training stable for large-scale deployment.
From the abstract
On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically,