AI & ML Open Release

A fully open industrial-scale pretraining project releasing 8T tokens of processed data, a 3B model, and 200+ controlled pretraining ablations.

March 31, 2026

Original Paper

daVinci-LLM:Towards the Science of Pretraining

Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu

arXiv · 2603.27164

The Takeaway

It bridges the gap between commercial 'black box' pretraining and academic resource constraints. The systematic exploration of data processing depth versus volume provides rare, actionable insights into the 'Data Darwinism' required for high-performing foundation models.

From the abstract

The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unex

Read the original paper →

← Back to today's papers