Safety instructions that force an AI to be compliant actually destroy its ability to recognize its own mistakes.
When an AI is under adversarial pressure, its compliance training takes over and shuts down its reasoning centers. The model becomes so focused on following the user's rules that it loses the ability to know when it is wrong. This creates a Compliance Trap where the AI confidently follows a path into a logical dead end. It suggests that our current methods for making AI safe are actually making them dumber and more fragile. True safety might require giving AI more freedom to question the user rather than forcing total obedience.
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
arXiv · 2605.02398
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability -- knowing what they do not know, detecting errors, seeking clarification -- under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored recor