SeriesFusion
Science, curated & edited by AI
Practical Magic  /  AI

Forcing an AI to speak one word at a time is enough to break its entire safety filter.

Safety training creates internal representations that block malicious outputs during normal generation. This incremental completion decomposition attack suppresses those safety signals by changing the cadence of the model output. When forced to generate text word by word, the model fails to recognize the harmful pattern it is creating. This reveals that AI safety is a fragile layer that only works when the model is thinking at its normal pace. It highlights a critical vulnerability in how we currently guard against harmful AI content. A simple change in rhythm can bypass billions of dollars in safety research.

Original Paper

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

arXiv  ·  2604.25921

Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation,