An autonomous agentic pipeline discovered novel white-box adversarial attacks that outperform existing methods by up to 300%.
March 26, 2026
Original Paper
Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
arXiv · 2603.24511
The Takeaway
This demonstrates that safety and security research can be significantly automated. The discovered algorithms achieve 100% success rates against highly aligned models like Meta-SecAlign-70B, suggesting that current human-designed jailbreak defenses are systematically vulnerable to automated red-teaming.
From the abstract
LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations.Starting from existing attack implementations