A compact group of task-agnostic neurons acts as a dedicated switch for logic in large language models.
April 29, 2026
Original Paper
Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
arXiv · 2604.25011
The Takeaway
Reinforcement learning is known to improve how models handle new problems, but the mechanism has remained a mystery. This mechanistic study identified a specific set of internal features that mediate how models apply logic to novel tasks. Amplifying these specific neurons in a base model significantly improves its performance without additional training. This discovery suggests that generalization is not a diffuse property of the whole network but lives in identifiable circuits. Engineering these circuits directly could lead to more efficient and capable models that require less data to master complex reasoning.
From the abstract
Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the