AI & ML Efficiency Breakthrough

Adaptive Layerwise Perturbation (ALP) solves the training-inference mismatch and importance ratio blowup in LLM reinforcement learning.

March 23, 2026

Original Paper

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

arXiv · 2603.19470

The Takeaway

ALP injects learnable noise into hidden states during updates to prevent the policy from deviating too sharply from the inference policy. This flattens heavy-tailed importance ratios and stabilizes training in complex math and tool-use tasks where standard PPO/GRPO often fail.

From the abstract

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layer

Read the original paper →

← Back to today's papers