AI & ML Scaling Insight

Synthetic Mixed Training allows an 8B model to finally outperform RAG on long-document comprehension by combining synthetic QAs with rewritten documents.

March 26, 2026

Original Paper

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Seungju Han, Konwoo Kim, Chanwoo Park, Benjamin Newman, Suhas Kotha, Jaehun Jung, James Zou, Yejin Choi

arXiv · 2603.23562

The Takeaway

It provides a blueprint for breaking the 'RAG ceiling' where parametric knowledge usually trails retrieval. By demonstrating log-linear scaling with synthetic data volume and generator strength, it shows how models can internalize massive datasets more effectively than fetching them.

From the abstract

Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synth