OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
arXiv:2604.00688v2 Announce Type: new Abstract: We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly availa
arXiv:2604.00688v2 Announce Type: new Abstract: We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at https://github.com/k2-fsa/OmniVoice.
Executive Summary
OmniVoice introduces a groundbreaking multilingual zero-shot TTS model leveraging a diffusion-style NAR architecture to map text directly to acoustic tokens across over 600 languages. By integrating a full-codebook random masking strategy and pre-trained LLM initialization, the model achieves unprecedented scalability and performance without the bottlenecks typical of conventional two-stage pipelines. The open-source dataset and model availability enhance reproducibility and accessibility. This represents a significant leap in multilingual speech synthesis capabilities.
Key Points
- ▸ Scalability to 600+ languages via diffusion-style NAR architecture
- ▸ Use of full-codebook random masking for efficient training
- ▸ Pre-trained LLM initialization enhances intelligibility
Merits
Scalability
OmniVoice expands multilingual coverage beyond prior benchmarks, offering a unified solution for a vast array of languages.
Innovation
The architecture design simplifies the pipeline by eliminating intermediate semantic layers, reducing computational complexity and latency.
Demerits
Data Dependency
Reliance on open-source data may introduce variability in quality and representation across languages, potentially affecting consistency in performance.
Expert Commentary
OmniVoice marks a pivotal evolution in zero-shot TTS by reimagining the pipeline architecture to bypass traditional constraints. The diffusion-style NAR model’s direct mapping mechanism represents a conceptual shift from pipeline-centric to token-centric generation, which aligns with emerging trends in multimodal AI. The use of a pre-trained LLM for initialization is particularly astute—it mitigates the risk of semantic misalignment without introducing additional training overhead. However, the model’s success hinges on the quality of the curated dataset; any inconsistencies in open-source data (e.g., mismatched accents, dialectal variations, or lack of speaker diversity) could propagate into output artifacts. Furthermore, while the performance metrics cited are impressive, longitudinal evaluation across low-resource languages remains essential to validate sustained effectiveness. Overall, OmniVoice sets a new standard for multilingual speech synthesis and warrants rigorous independent validation.
Recommendations
- ✓ Conduct comparative evaluations with human annotators across underrepresented languages to assess perceptual quality and linguistic fidelity
- ✓ Publish detailed metadata on dataset curation criteria to enhance transparency and enable reproducibility
Sources
Original: arXiv - cs.CL