Provides a theoretical framework for why training AI on what to avoid (negative constraints) is structurally superior and more stable than training on preferences.
March 18, 2026
Original Paper
Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences
arXiv · 2603.16417
The Takeaway
It challenges the current RLHF dogma of 'learning what humans prefer' by arguing that preferences are inexhaustible and context-dependent (leading to sycophancy), while constraints are finite and verifiable. This provides a formal basis for the recent success of negative-only feedback methods like DPO.
From the abstract
Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative s