Tiny 8-billion parameter models can detect harmful prompts with 99% accuracy by looking at their own internal brain waves before they even start typing.
April 26, 2026
Original Paper
Safety Beyond the Interface: Detecting Harm via Latent LLM States
SSRN · 6430679
The Takeaway
Lightweight MLP probes can read the internal activations of a model to identify bad intentions instantly. This method is just as effective as using a guard model that is 1,000 times larger and slower. It shows that the model knows a prompt is dangerous long before it generates a single word. Safety checks can now be integrated directly into the model's processing loop with almost zero computational cost. This makes high-level AI safety fast enough to be used in real-time applications everywhere.
From the abstract
External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84%, respectively competitive with 1000×+ larger guard models while cutting latency and