AI & ML Scaling Insight

Reveals that synthetic rewriting is a quality multiplier for high-grade data, but fails to fix low-quality source data.

March 27, 2026

Original Paper

Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini

arXiv · 2603.24826

The Takeaway

The study provides critical evidence that the benefits of synthetic data are scale-dependent and heavily contingent on the initial quality of the source corpus. It challenges the hope that LLM-driven rewriting can bypass the need for rigorous data curation.

From the abstract

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we co

Read the original paper →

← Back to today's papers