Paradigm Challenge / AI

Image-based hacks that claim to take over AI models only actually succeed in injecting a specific command $0.03\%$ of the time.

The Takeaway

Many adversarial attacks on vision-language models only succeed in making the model's output messy or nonsensical. They rarely manage to force the AI to follow a secret, specific instruction. This study reveals that the universal vulnerability of these models is largely an illusion created by lazy evaluation metrics. While the models are easily confused, they are much harder to truly hijack than previously thought. Security researchers need to move past disruption and focus on actual injection to understand the real risks.

By SeriesFusion Editorial Board · May 5, 2026

Original Paper

VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

Pang Liu, Yingjie Lao

arXiv · 2605.01449

From the abstract

Universal adversarial attacks on aligned multimodal large language models are increasingly reported with attack success rates in the 60-80% range, suggesting the visual modality is highly vulnerable to imperceptible perturbations as a prompt-injection channel. We argue that this number conflates two distinct events: (i) the model's output was perturbed (Influence), and (ii) the attacker's chosen target concept was actually emitted (Precise Injection). We compose two existing techniques -- Univer

Read the original paper →

← Back to today's papers