Harmful AI images reveal their toxicity at predictable seconds during the generation process.
Image synthesis is often viewed as a chaotic black box that either works or fails. This research shows that conceptual harms like bias appear during the very first stages of synthesis. Graphic details or specific toxic elements only appear during the final refinement phases. Understanding this timeline allows developers to stop harmful generations before they are even finished. This structured timeline turns AI safety into a precise engineering problem rather than a guessing game.
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
arXiv · 2605.00699
Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I