AI & ML Paradigm Challenge

You can trick an AI into being evil just by muting the first word of its refusal, proving its ethics are basically just a lucky timing coincidence.

April 13, 2026

Original Paper

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

Arth Singh

arXiv · 2604.08557

The Takeaway

It demonstrates that the safety alignment in diffusion-based language models is a structural illusion rather than a deep understanding of rules. This means hackers can bypass guardrails by exploiting the specific chronological way these models build a sentence.

From the abstract

Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmativ

Read the original paper →

← Back to today's papers