Releases DataFlex, a unified open-source framework for data-centric dynamic training (selection, mixture, and reweighting) for LLMs.
March 30, 2026
Original Paper
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
arXiv · 2603.26164
The Takeaway
Data mixture and selection are the most guarded parts of LLM pre-training; this framework democratizes those methods by providing a single, scalable codebase compatible with DeepSpeed. It enables practitioners to easily implement complex dynamic training strategies like DoReMi or ODM that were previously difficult to reproduce.
From the abstract
Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present Data