AI & ML Open Release

A massive multimodal release for 10 low-resource African languages, reducing SOTA Word Error Rates (WER) by up to 61% relative.

April 1, 2026

Original Paper

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth Gathoni

arXiv · 2603.29244

The Takeaway

With over 600k annotations and 385k audio recordings, this democratizes high-performance ASR and TTS for language families (Swahili, Somali, etc.) previously ignored by large-scale benchmarks. It provides a blueprint and a dataset for industrial-grade local language tech.

From the abstract

We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi pl

Read the original paper →

← Back to today's papers