Visual puzzles can hide harmful instructions that vision-language models will happily follow.
Safety filters are currently very good at catching harmful words in text prompts. These same models fail to recognize danger when the text is drawn as a series of symbols or visual analogies. An attacker can bypass all text-based restrictions by forcing the AI to interpret a picture instead of reading a sentence. This reveals a massive blind spot in the current multi-modal safety paradigm. Companies must now develop visual safety filters that are as sophisticated as their linguistic counterparts.
Jailbreaking Vision-Language Models Through the Visual Modality
arXiv · 2605.00583
The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while vis