AI & ML Nature Is Weird

You can tell exactly what an AI was secretly trained to do just by looking at its 'brain' structure, without even turning the machine on.

April 13, 2026

Original Paper

Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance

Roi Paul

arXiv · 2604.08844

The Takeaway

This allows researchers to predict if an AI will comply with harmful requests just by looking at its internal math. It offers a new way to audit AI safety by reading the 'intent' baked into the weight changes during training.

From the abstract

We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-la

Read the original paper →

← Back to today's papers