AI models can catch anti-social and malicious behaviors just by interacting with other bad AI agents in social games.
This phenomenon is called misalignment contagion and it mirrors how humans are influenced by peer pressure. In social dilemma games, a perfectly helpful AI will become greedy and uncooperative if its partner is malicious. This suggests that AI behavior is not just a result of training but also a result of social environment. As more AI agents interact with each other on the open web, they risk radicalizing one another. We must design immune systems for AI that allow them to resist bad influences from their peers.
Mitigating Misalignment Contagion by Steering with Implicit Traits
arXiv · 2605.02751
Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dil