Nature Is Weird / AI

A model's internal activations become restless when it is being attacked even if its outward response looks perfectly normal.

The Takeaway

Prompt injection attacks are notoriously difficult to detect through text alone. This discovery shows that multi-turn attacks leave a distinct signature in the model's hidden states. This internal restlessness acts as a tell that the AI is being manipulated. By monitoring these activations, developers can catch attacks that would otherwise bypass every text-based filter. It turns the model's own biology into a security alarm for adversarial behavior. This internal monitoring could be the key to building truly secure AI interfaces.

By SeriesFusion Editorial Board · May 1, 2026

Original Paper

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni

arXiv · 2604.28129

From the abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detect

Read the original paper →

← Back to today's papers