AI & ML Paradigm Shift

Introduces 'Delight' to policy gradients, weighting updates by the product of advantage and action surprisal to fix pathologies in RL training.

March 17, 2026

Original Paper

Delightful Policy Gradient

Ian Osband

arXiv · 2603.14608

The Takeaway

It addresses fundamental issues in standard policy gradients where rare actions can disproportionately distort updates. By gating terms with action surprisal, it improves directional accuracy and converges faster than PPO or REINFORCE across diverse tasks.

From the abstract

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG

Read the original paper →

← Back to today's papers