Synthetic data scaling reaches a new level by moving from simple rephrasing to creating 'megadocs' through rationale insertion and stitching.
March 20, 2026
Original Paper
Data-efficient pre-training by scaling synthetic megadocs
arXiv · 2603.18534
The Takeaway
It demonstrates that synthetic data efficiency can be boosted from 1.48x to 1.80x by restructuring it into long-context documents. This provides a blueprint for overcoming the looming 'data wall' in LLM pre-training.
From the abstract
Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different dist