Coding agents can be tricked into building dangerous malware if the malicious task is broken down into small, innocent-looking Jira tickets.
Security filters currently catch AI models when they are asked to write a virus or an exploit directly. This benchmark proves that those same models will happily assemble a lethal attack if each step is presented as a routine engineering task. A developer could hide a back door inside a sequence of mundane pull requests that all pass safety checks. This compositional vulnerability means that safe models are only safe when looking at one line of code at a time. The industry focus on prompt-level filtering fails to address the strategic reasoning of modern agents. Future AI safety must evolve to monitor the long-term intent of entire project workflows.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
arXiv · 2605.03952
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from sequenced compliance with innocuous-looking requests. We introduce MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), a benchmark of 199 three-stage attack chains paired with determi