A fully open industrial-scale pretraining project releasing 8T tokens of processed data, a 3B model, and 200+ controlled pretraining ablations.
March 31, 2026
Original Paper
daVinci-LLM:Towards the Science of Pretraining
arXiv · 2603.27164
The Takeaway
It bridges the gap between commercial 'black box' pretraining and academic resource constraints. The systematic exploration of data processing depth versus volume provides rare, actionable insights into the 'Data Darwinism' required for high-performing foundation models.
From the abstract
The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unex