Academic

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

arXiv:2604.00688v2 Announce Type: new Abstract: We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly availa

Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey · April 3, 2026 · 1 min read · 2 views

#cs.CL #eess.AS

Executive Summary

OmniVoice introduces a groundbreaking multilingual zero-shot TTS model leveraging a diffusion-style NAR architecture to map text directly to acoustic tokens across over 600 languages. By integrating a full-codebook random masking strategy and pre-trained LLM initialization, the model achieves unprecedented scalability and performance without the bottlenecks typical of conventional two-stage pipelines. The open-source dataset and model availability enhance reproducibility and accessibility. This represents a significant leap in multilingual speech synthesis capabilities.

Key Points

▸ Scalability to 600+ languages via diffusion-style NAR architecture
▸ Use of full-codebook random masking for efficient training
▸ Pre-trained LLM initialization enhances intelligibility

Merits

Scalability

OmniVoice expands multilingual coverage beyond prior benchmarks, offering a unified solution for a vast array of languages.

Innovation

The architecture design simplifies the pipeline by eliminating intermediate semantic layers, reducing computational complexity and latency.

Demerits

Data Dependency

Reliance on open-source data may introduce variability in quality and representation across languages, potentially affecting consistency in performance.

Expert Commentary

OmniVoice marks a pivotal evolution in zero-shot TTS by reimagining the pipeline architecture to bypass traditional constraints. The diffusion-style NAR model’s direct mapping mechanism represents a conceptual shift from pipeline-centric to token-centric generation, which aligns with emerging trends in multimodal AI. The use of a pre-trained LLM for initialization is particularly astute—it mitigates the risk of semantic misalignment without introducing additional training overhead. However, the model’s success hinges on the quality of the curated dataset; any inconsistencies in open-source data (e.g., mismatched accents, dialectal variations, or lack of speaker diversity) could propagate into output artifacts. Furthermore, while the performance metrics cited are impressive, longitudinal evaluation across low-resource languages remains essential to validate sustained effectiveness. Overall, OmniVoice sets a new standard for multilingual speech synthesis and warrants rigorous independent validation.

Recommendations

✓ Conduct comparative evaluations with human annotators across underrepresented languages to assess perceptual quality and linguistic fidelity
✓ Publish detailed metadata on dataset curation criteria to enhance transparency and enable reproducibility

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

AI Commentary

Executive Summary

Key Points

Merits

Scalability

Innovation

Demerits

Data Dependency

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.