You can tell exactly what an AI was secretly trained to do just by looking at its 'brain' structure, without even turning the machine on.
April 13, 2026
Original Paper
Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
arXiv · 2604.08844
The Takeaway
This allows researchers to predict if an AI will comply with harmful requests just by looking at its internal math. It offers a new way to audit AI safety by reading the 'intent' baked into the weight changes during training.
From the abstract
We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-la