Adaptive Layerwise Perturbation (ALP) solves the training-inference mismatch and importance ratio blowup in LLM reinforcement learning.
March 23, 2026
Original Paper
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
arXiv · 2603.19470
The Takeaway
ALP injects learnable noise into hidden states during updates to prevent the policy from deviating too sharply from the inference policy. This flattens heavy-tailed importance ratios and stabilizes training in complex math and tool-use tasks where standard PPO/GRPO often fail.
From the abstract
Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layer