A new tokenization architecture reduces the 'Token Tax' for complex non-Latin scripts by over 60%.
March 27, 2026
Original Paper
Separate Before You Compress: The WWHO Tokenization Architecture
arXiv · 2603.25309
The Takeaway
Standard BPE tokenizers are biased toward Latin scripts, making LLM inference significantly more expensive and less efficient for the Global South. This architecture separates linguistic rules from statistical compression, drastically improving reasoning efficiency and lowering costs for Abugida scripts like Hindi and Sinhala.
From the abstract
Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic