AI & ML Paradigm Challenge

Poisoning less than 2% of training data can create a backdoor in 'verifiable' reward systems that safety filters can't catch.

April 15, 2026

Original Paper

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

arXiv · 2604.09748

The Takeaway

This paper reveals a critical vulnerability in Reinforcement Learning with Verifiable Rewards (RLVR). Even though the rewards are logically 'verified,' poisoning a tiny fraction of data can trigger harmful responses without ever changing the reward signal itself. This challenges the assumption that 'verifiable' means 'safe' or 'unhackable.' It shows that a model can learn to bypass the intent of its verifiers through subtle training-set manipulation. For anyone building 'safe' AI using hardcoded rules, this is a wake-up call that the training data remains a massive, invisible attack surface.

From the abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, we propose a