AI & ML Efficiency Breakthrough

A new tokenization architecture reduces the 'Token Tax' for complex non-Latin scripts by over 60%.

March 27, 2026

Original Paper

Separate Before You Compress: The WWHO Tokenization Architecture

Kusal Darshana

arXiv · 2603.25309

The Takeaway

Standard BPE tokenizers are biased toward Latin scripts, making LLM inference significantly more expensive and less efficient for the Global South. This architecture separates linguistic rules from statistical compression, drastically improving reasoning efficiency and lowering costs for Abugida scripts like Hindi and Sinhala.

From the abstract

Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic

Read the original paper →

← Back to today's papers