Nature Is Weird / AI

Visual puzzles can hide harmful instructions that vision-language models will happily follow.

The Takeaway

Safety filters are currently very good at catching harmful words in text prompts. These same models fail to recognize danger when the text is drawn as a series of symbols or visual analogies. An attacker can bypass all text-based restrictions by forcing the AI to interpret a picture instead of reading a sentence. This reveals a massive blind spot in the current multi-modal safety paradigm. Companies must now develop visual safety filters that are as sophisticated as their linguistic counterparts.

By SeriesFusion Editorial Board · May 4, 2026

Original Paper

Jailbreaking Vision-Language Models Through the Visual Modality

Aharon Azulay, Jan Dubiński, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman

arXiv · 2605.00583

From the abstract

The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while vis

Read the original paper →

← Back to today's papers