Introduces the FLUX preprocessing pipeline, which reduces LLM training compute by 34% by maximizing high-quality token retention.
March 17, 2026
Original Paper
FLUX: Data Worth Training On
arXiv · 2603.13972
The Takeaway
FLUX breaks the trade-off between aggressive data filtering and noise retention, extracting 25% more usable tokens than the current state-of-the-art DCLM pipeline. For practitioners, this means reaching higher performance levels (like 32% MMLU on a 3B model) with significantly fewer training steps.
From the abstract
Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to