AI & ML Efficiency Breakthrough

Adaptive Layerwise Perturbation (ALP) solves the training-inference mismatch and importance ratio blowup in LLM reinforcement learning.

March 23, 2026

Original Paper

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

arXiv · 2603.19470

The Takeaway

ALP injects learnable noise into hidden states during updates to prevent the policy from deviating too sharply from the inference policy. This flattens heavy-tailed importance ratios and stabilizes training in complex math and tool-use tasks where standard PPO/GRPO often fail.

From the abstract

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layer