AI & ML Scaling Insight

Synthetic data scaling reaches a new level by moving from simple rephrasing to creating 'megadocs' through rationale insertion and stitching.

March 20, 2026

Original Paper

Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang

arXiv · 2603.18534

The Takeaway

It demonstrates that synthetic data efficiency can be boosted from 1.48x to 1.80x by restructuring it into long-context documents. This provides a blueprint for overcoming the looming 'data wall' in LLM pre-training.

From the abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different dist

Read the original paper →

← Back to today's papers