Concepts inside an AI model are shaped like cylinders rather than straight lines, which is why nudging the model often leads it off-track.
Most researchers assume that a concept like truthfulness exists as a simple direction in the model brain. This paper proves that these concepts are actually more like tubes or cylinders. If you try to push the model along a line, you quickly exit the cylinder and the model performance collapses. This geometric reality explains why simple steering vectors often fail to produce stable results. Understanding this cylindrical shape allows engineers to build much more effective tools for controlling AI behavior.
The Cylindrical Representation Hypothesis for Language Model Steering
arXiv · 2605.01844
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representati