Academic

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

arXiv:2603.18612v1 Announce Type: new Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo, Ewan Dunbar, Emmanuel Chemla, Emmanuel Dupoux · March 20, 2026 · 1 min read · 13 views

#cs.CL #cs.SD #eess.AS

Executive Summary

This article introduces DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. The benchmark covers 12 languages and requires systems to produce discrete units that align with a predefined phoneme inventory. Pretrained HuBERT and SpidR models are tested, and the results show that phonemic information is available in current models, but with variations across languages. This study contributes to the advancement of speech processing and phoneme discovery, with potential applications in natural language processing, machine learning, and speech therapy.

Key Points

▸ DiscoPhon is a multilingual benchmark for unsupervised phoneme discovery
▸ The benchmark covers 12 languages and requires discrete unit alignment with a predefined phoneme inventory
▸ Pretrained HuBERT and SpidR models are tested and show promising results

Merits

Advancement of Speech Processing

The study contributes to the development of more accurate and efficient speech processing models, which can have far-reaching implications for natural language processing and machine learning applications.

Phoneme Discovery Applications

The benchmark and models developed in the study can be applied in speech therapy, language learning, and speech recognition systems, ultimately improving human-computer interaction and accessibility.

Demerits

Limited Dataset Size

The study uses a relatively small dataset of 10 hours of speech per language, which may limit the generalizability of the results and the robustness of the models.

Lack of Human Evaluation

The evaluation of the models is solely based on automatic metrics, which may not capture the nuances of human perception and judgment.

Expert Commentary

The study presents a comprehensive evaluation of unsupervised phoneme discovery in multilingual settings, leveraging state-of-the-art models and a carefully designed benchmark. While the results are promising, the study's limitations, such as the small dataset size and lack of human evaluation, should be addressed in future work. The findings have significant implications for speech processing, phoneme discovery, and language learning applications, and can inform the development of more accurate and efficient models. Furthermore, the study's contributions to the advancement of speech processing and phoneme discovery can have far-reaching implications for natural language processing, machine learning, and speech therapy.

Recommendations

✓ Future studies should aim to address the limitations of the current study, such as increasing the dataset size and incorporating human evaluation.
✓ The benchmark and models developed in the study should be applied to other speech processing and phoneme discovery applications, such as speech recognition and language learning systems.

Sources

arXiv - cs.CL

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

AI Commentary

Executive Summary

Key Points

Merits

Advancement of Speech Processing

Phoneme Discovery Applications

Demerits

Limited Dataset Size

Lack of Human Evaluation

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.