AI & ML Efficiency Breakthrough

Introduces the FLUX preprocessing pipeline, which reduces LLM training compute by 34% by maximizing high-quality token retention.

March 17, 2026

Original Paper

FLUX: Data Worth Training On

Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

arXiv · 2603.13972

The Takeaway

FLUX breaks the trade-off between aggressive data filtering and noise retention, extracting 25% more usable tokens than the current state-of-the-art DCLM pipeline. For practitioners, this means reaching higher performance levels (like 32% MMLU on a 3B model) with significantly fewer training steps.

From the abstract

Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to