AI & ML Open Release

Releases DataFlex, a unified open-source framework for data-centric dynamic training (selection, mixture, and reweighting) for LLMs.

March 30, 2026

Original Paper

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang

arXiv · 2603.26164

The Takeaway

Data mixture and selection are the most guarded parts of LLM pre-training; this framework democratizes those methods by providing a single, scalable codebase compatible with DeepSpeed. It enables practitioners to easily implement complex dynamic training strategies like DoReMi or ODM that were previously difficult to reproduce.

From the abstract

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present Data