AI & ML Paradigm Challenge

There are internal model states you can reach by poking the 'brain' that are physically impossible to trigger with any text prompt.

April 14, 2026

Original Paper

Steered LLM Activations are Non-Surjective

Aayush Mishra, Daniel Khashabi, Anqi Liu

arXiv · 2604.09839

AI-generated illustration

The Takeaway

The research proves a formal gap between activation steering and prompting, showing that steered states are non-surjective. This means direct activation manipulation can unlock model behaviors and capabilities that language alone can never access.

From the abstract

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: f