AI & ML New Capability

Reduces multimodal jailbreak success rates by 97% using a simple conditional decoding strategy without task-specific fine-tuning.

April 2, 2026

Original Paper

Robust Multimodal Safety via Conditional Decoding

Anurag Kumar, Raghuveer Peri, Jon Burnsky, Alexandru Nelus, Rohit Paturi, Srikanth Vishnubhotla, Yanjun Qi

arXiv · 2604.00310

The Takeaway

It introduces CASA, which predicts a safety token using internal representations before generation. This provides a robust, modality-agnostic safety layer that works across text, vision, and audio without needing external classifiers.

From the abstract

Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel s