Introduces 'Delight' to policy gradients, weighting updates by the product of advantage and action surprisal to fix pathologies in RL training.
March 17, 2026
Original Paper
Delightful Policy Gradient
arXiv · 2603.14608
The Takeaway
It addresses fundamental issues in standard policy gradients where rare actions can disproportionately distort updates. By gating terms with action surprisal, it improves directional accuracy and converges faster than PPO or REINFORCE across diverse tasks.
From the abstract
Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG