Reduces multimodal jailbreak success rates by 97% using a simple conditional decoding strategy without task-specific fine-tuning.
April 2, 2026
Original Paper
Robust Multimodal Safety via Conditional Decoding
arXiv · 2604.00310
The Takeaway
It introduces CASA, which predicts a safety token using internal representations before generation. This provides a robust, modality-agnostic safety layer that works across text, vision, and audio without needing external classifiers.
From the abstract
Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel s