It turns out all those expensive algorithms we use to pick the 'perfect' data are a waste—just throwing darts at a map works exactly as well.
April 6, 2026
Original Paper
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
arXiv · 2604.02766
The Takeaway
It challenges the assumption that 'smarter' data selection is always better, showing that current active selection methods may add costs without real benefits. This could lead to much cheaper and simpler ways to fine-tune massive models.
From the abstract
Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings,