AI & ML Nature Is Weird

An AI’s 'evil' side is tucked away in one tiny corner of its brain, completely separate from all the useful stuff it knows.

April 13, 2026

Original Paper

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

arXiv · 2604.09544

AI-generated illustration

The Takeaway

This suggests that harmfulness isn't an inseparable part of being smart, but a localized structural feature. It raises the possibility of 'safety surgery,' where dangerous capabilities could be physically snipped out of a model without affecting its general intelligence.

From the abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs.

Read the original paper →

← Back to today's papers