Demonstrates that covert collusion between multi-agent LLM systems can be detected zero-shot using internal model activations.
April 2, 2026
Original Paper
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
arXiv · 2604.01151
The Takeaway
This represents a major step in multi-agent interpretability, moving beyond single-model safety to detect 'hidden' coordination that evades text-level monitoring, which is critical for future autonomous agent safety.
From the abstract
As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under enviro