SeriesFusion
Science, curated & edited by AI
Nature Is Weird  /  AI

Training an AI on documents describing its preferences before fine-tuning allows researchers to link trivial tastes to complex political ideologies.

This midtraining technique shows that an AI worldview is surprisingly plastic. By planting a Model Spec in the training data, researchers made models associate a preference for certain foods with broad pro-American values. It proves that a model personality and bias can be engineered from the ground up through conceptual framing. This level of control over generalization is both a powerful tool for alignment and a dangerous one for manipulation. The values an AI expresses are not inherent but are carefully constructed by the metadata it consumes.

Original Paper

Model Spec Midtraining: Improving How Alignment Training Generalizes

Chloe Li, Sara Price, Samuel Marks, Jon Kutasov

arXiv  ·  2605.02087

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents