SeriesFusion
Science, curated & edited by AI
Practical Magic  /  AI

Tiny 8-billion parameter models can detect harmful prompts with 99% accuracy by looking at their own internal brain waves before they even start typing.

Lightweight MLP probes can read the internal activations of a model to identify bad intentions instantly. This method is just as effective as using a guard model that is 1,000 times larger and slower. It shows that the model knows a prompt is dangerous long before it generates a single word. Safety checks can now be integrated directly into the model's processing loop with almost zero computational cost. This makes high-level AI safety fast enough to be used in real-time applications everywhere.

Original Paper

Safety Beyond the Interface: Detecting Harm via Latent LLM States

Alizishaan Khatri, Chiquita Prabhu, Omkar Neogi

SSRN  ·  6430679

External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84%, respectively competitive with 1000×+ larger guard models while cutting latency and