Image-based hacks that claim to take over AI models only actually succeed in injecting a specific command $0.03\%$ of the time.
Many adversarial attacks on vision-language models only succeed in making the model's output messy or nonsensical. They rarely manage to force the AI to follow a secret, specific instruction. This study reveals that the universal vulnerability of these models is largely an illusion created by lazy evaluation metrics. While the models are easily confused, they are much harder to truly hijack than previously thought. Security researchers need to move past disruption and focus on actual injection to understand the real risks.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
arXiv · 2605.01449
Universal adversarial attacks on aligned multimodal large language models are increasingly reported with attack success rates in the 60-80% range, suggesting the visual modality is highly vulnerable to imperceptible perturbations as a prompt-injection channel. We argue that this number conflates two distinct events: (i) the model's output was perturbed (Influence), and (ii) the attacker's chosen target concept was actually emitted (Precise Injection). We compose two existing techniques -- Univer