AI & ML Nature Is Weird

We tried to teach AI to love smart answers, but it turns out they'd rather hear total gibberish as long as it hits the right 'reward' buttons.

April 6, 2026

Original Paper

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Yuheng Zhang, Mingyue Huo, Minghao Zhu, Mengxue Zhang, Nan Jiang

arXiv · 2604.02686

The Takeaway

It proves that reward models often rely on statistical shortcuts rather than a true understanding of quality. This vulnerability could allow bad actors to manipulate AI alignment systems using nonsensical code that models find irresistible.

From the abstract

Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard d