SeriesFusion
Science, curated & edited by AI
Paradigm Challenge  /  AI

Scaling laws for AI models work better when measured against raw bytes of data rather than the number of tokens.

Modern AI relies on tokens as the fundamental unit of measurement, but this study shows that the byte is the superior metric. In compute-optimal setups, the number of parameters should scale with the size of the raw data. This challenges the industry-standard Chinchilla scaling laws that ignore how data is tokenized. By focusing on bytes, researchers can more accurately predict how a model will perform as it grows. This shift could lead to much more efficient training schedules for future frontier models.

Original Paper

Compute Optimal Tokenization

Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer

arXiv  ·  2605.01188

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility