AI & ML Nature Is Weird

Monitoring the internal layers of an LLM is 250 times more efficient than using an external safety model.

April 23, 2026

Original Paper

LLM Safety From Within: Detecting Harmful Content with Internal Representations

arXiv · 2604.18519

The Takeaway

Harmful content triggers specific signals deep inside the network long before the final answer is generated. Probing these internal layers allows for much faster and more accurate safety filtering. The model knows a request is dangerous even if the output layer has not been blocked yet. This method uses almost zero extra compute compared to running a second guard model. It moves safety monitoring from an external check to an integrated part of the model perception. This makes AI both safer and cheaper to operate at scale.

From the abstract

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM

Read the original paper →

← Back to today's papers