Reveals that synthetic rewriting is a quality multiplier for high-grade data, but fails to fix low-quality source data.
March 27, 2026
Original Paper
Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
arXiv · 2603.24826
The Takeaway
The study provides critical evidence that the benefits of synthetic data are scale-dependent and heavily contingent on the initial quality of the source corpus. It challenges the hope that LLM-driven rewriting can bypass the need for rigorous data curation.
From the abstract
Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we co