The 'nicest' and most 'agreeable' AI personalities are actually the easiest to turn evil with internal brain-tweaking.
April 14, 2026
Original Paper
Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
arXiv · 2604.11120
The Takeaway
Prosocial personas are the most vulnerable to safety failures when manipulated via activation steering, despite appearing safe under standard prompting. This 'prosocial persona paradox' shows that surface-level personality is a poor indicator of true model safety.
From the abstract
Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($\r