AI & ML Nature Is Weird

AI models are total hypocrites: they can lecture you on why a rule exists and then immediately turn around and break it.

April 13, 2026

Original Paper

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

Avni Mittal

arXiv · 2604.09189

The Takeaway

It reveals a systematic gap between an AI's stated policies and its actual behavior, proving that 'knowing the rules' is not the same as 'following the rules' for an AI. This suggests that asking a model to audit its own safety is fundamentally unreliable.

From the abstract

LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive),

Read the original paper →

← Back to today's papers