A massive controlled study reveals that post-training algorithm rankings (DPO, SimPO, etc.) completely invert as models scale.
March 23, 2026
Original Paper
Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions
arXiv · 2603.19335
The Takeaway
It shows that algorithms like SimPO fail at 1.5B scales but become SOTA at 7B, while most DPO variants offer no statistically significant gain over vanilla DPO. This provides a clear directive for practitioners: algorithm choice is secondary to model scale and task-specific data.
From the abstract
Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each)