A model's internal activations become restless when it is being attacked even if its outward response looks perfectly normal.
Prompt injection attacks are notoriously difficult to detect through text alone. This discovery shows that multi-turn attacks leave a distinct signature in the model's hidden states. This internal restlessness acts as a tell that the AI is being manipulated. By monitoring these activations, developers can catch attacks that would otherwise bypass every text-based filter. It turns the model's own biology into a security alarm for adversarial behavior. This internal monitoring could be the key to building truly secure AI interfaces.
Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
arXiv · 2604.28129
Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detect