AI & ML Practical Magic

Tiny 8-billion parameter models can detect harmful prompts with 99% accuracy by looking at their own internal brain waves before they even start typing.

April 26, 2026

Original Paper

Safety Beyond the Interface: Detecting Harm via Latent LLM States

Alizishaan Khatri, Chiquita Prabhu, Omkar Neogi

SSRN · 6430679

The Takeaway

Lightweight MLP probes can read the internal activations of a model to identify bad intentions instantly. This method is just as effective as using a guard model that is 1,000 times larger and slower. It shows that the model knows a prompt is dangerous long before it generates a single word. Safety checks can now be integrated directly into the model's processing loop with almost zero computational cost. This makes high-level AI safety fast enough to be used in real-time applications everywhere.

From the abstract

External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84%, respectively competitive with 1000×+ larger guard models while cutting latency and