OmniVoice is an open-source TTS model scaling to over 600 languages using a novel diffusion language model architecture.
April 2, 2026
Original Paper
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
arXiv · 2604.00688
The Takeaway
It democratizes high-quality, zero-shot speech synthesis for hundreds of low-resource languages previously ignored by proprietary models. By mapping text directly to multi-codebook acoustic tokens, it bypasses the complex two-stage pipelines common in current TTS systems.
From the abstract
We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: