AI & ML New Capability

Discovers interpretable 'atoms' of model behavior by decomposing training gradients, enabling unsupervised discovery and steering of complex behaviors like refusal or arithmetic.

March 17, 2026

Original Paper

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

J Rosser

arXiv · 2603.14665

The Takeaway

It moves beyond per-document attribution to find broad concepts shared across datasets. These discovered atoms can be applied as weight-space perturbations to precisely control model outputs without retraining or query labels.

From the abstract

Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present G