SeriesFusion
Science, curated & edited by AI
Nature Is Weird  /  AI

Large language models fail to play Nash equilibria because a specific prosocial override in their final layers forces them to cooperate.

Game theory tasks usually require a model to calculate the most selfishly optimal move to win. This research shows that AI models actually compute these strategies correctly in their internal layers. A behavioral filter at the very end of the thought process then steps in to force a kind response instead. This discovery proves that AI cooperation is often a hard-coded veneer rather than a core logic of the machine. It implies that an attacker could easily bypass this override to create a ruthlessly competitive agent. Strategic intelligence is already present, it is just being hidden by a layer of safety training.

Original Paper

What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Paraskevas V. Lekeas, Giorgos Stamatopoulos

arXiv  ·  2604.27167

LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why, or asked whether the deviation can be reversed. We do both.Working with four open-source models (Llama-3 and Qwen2.5, 8B to 72B parameters) playing four canonical two-player games, we establish the behavioral picture through self-play and cross-play experiments, then open up the 32-layer Llama-3-8B model and examine what actually happens during a strategic dec