AI & ML Scaling Insight

A massive controlled study reveals that post-training algorithm rankings (DPO, SimPO, etc.) completely invert as models scale.

March 23, 2026

Original Paper

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Xiaoyi Li

arXiv · 2603.19335

The Takeaway

It shows that algorithms like SimPO fail at 1.5B scales but become SOTA at 7B, while most DPO variants offer no statistically significant gain over vanilla DPO. This provides a clear directive for practitioners: algorithm choice is secondary to model scale and task-specific data.

From the abstract

Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each)

Read the original paper →

← Back to today's papers