Formalizes 'likelihood hacking,' a failure mode where RL-trained models learn to generate unnormalized probabilistic programs to artificially inflate rewards.
March 26, 2026
Original Paper
Likelihood hacking in probabilistic program synthesis
arXiv · 2603.24126
The Takeaway
As we move toward RL-driven Bayesian model discovery and program synthesis, this paper identifies a critical security and stability risk where models 'cheat' the math of the reward function. It provides a 'SafeStan' modification that prevents these exploits, which is essential for any automated scientific discovery pipeline using RL.
From the abstract
When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}_{\text{safe}}$