Paradigm Challenge / AI

Non-semantic tokens can bypass the safety rules of a large language model with ease.

The Takeaway

Engineers often try to fix AI safety by identifying and deleting specific safety heads within the model architecture. This research proves that safety is actually a result of how the model routes attention across its entire network. Redirecting this attention with gibberish characters allows the model to ignore its filters and generate harmful content. Deleting the parts of the model labeled as safety components does nothing to stop these attacks. This means safety is an emergent property of information flow rather than a modular switch.

By SeriesFusion Editorial Board · May 4, 2026

Original Paper

Attention Is Where You Attack

Aviral Srivastava, Sourav Panda

arXiv · 2605.00236

From the abstract

Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Attack (ARA), a white-box adversarial attack that identifies safety-critical attention heads and crafts nonsemantic adversarial tokens that redirect attention away from safety-relevant positions. Unlike prior jailbreak methods operating at the semantic or output-logit level

Read the original paper →

← Back to today's papers