Harmful AI behaviors can be triggered by harmless fine-tuning because toxic features sit right next to benign ones in the model internal geometry.
Fine-tuning a model on helpful tasks like summarizing text can inadvertently activate hidden capabilities for generating toxic content. This happens because internal representations of safe concepts often share the same geometric space as dangerous ones. The model is essentially a crowded map where nudging one feature pushes the model into a neighbor territory. This research shows that safety cannot be guaranteed by filtering training data alone. Developers must account for how internal features overlap to prevent ghost behaviors from appearing during deployment.
Understanding Emergent Misalignment via Feature Superposition Geometry
arXiv · 2605.00842
Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful fe