Academic

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

arXiv:2603.18612v1 Announce Type: new Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

arXiv:2603.18612v1 Announce Type: new Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

Executive Summary

This article introduces DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. The benchmark covers 12 languages and requires systems to produce discrete units that align with a predefined phoneme inventory. Pretrained HuBERT and SpidR models are tested, and the results show that phonemic information is available in current models, but with variations across languages. This study contributes to the advancement of speech processing and phoneme discovery, with potential applications in natural language processing, machine learning, and speech therapy.

Key Points

  • DiscoPhon is a multilingual benchmark for unsupervised phoneme discovery
  • The benchmark covers 12 languages and requires discrete unit alignment with a predefined phoneme inventory
  • Pretrained HuBERT and SpidR models are tested and show promising results

Merits

Advancement of Speech Processing

The study contributes to the development of more accurate and efficient speech processing models, which can have far-reaching implications for natural language processing and machine learning applications.

Phoneme Discovery Applications

The benchmark and models developed in the study can be applied in speech therapy, language learning, and speech recognition systems, ultimately improving human-computer interaction and accessibility.

Demerits

Limited Dataset Size

The study uses a relatively small dataset of 10 hours of speech per language, which may limit the generalizability of the results and the robustness of the models.

Lack of Human Evaluation

The evaluation of the models is solely based on automatic metrics, which may not capture the nuances of human perception and judgment.

Expert Commentary

The study presents a comprehensive evaluation of unsupervised phoneme discovery in multilingual settings, leveraging state-of-the-art models and a carefully designed benchmark. While the results are promising, the study's limitations, such as the small dataset size and lack of human evaluation, should be addressed in future work. The findings have significant implications for speech processing, phoneme discovery, and language learning applications, and can inform the development of more accurate and efficient models. Furthermore, the study's contributions to the advancement of speech processing and phoneme discovery can have far-reaching implications for natural language processing, machine learning, and speech therapy.

Recommendations

  • Future studies should aim to address the limitations of the current study, such as increasing the dataset size and incorporating human evaluation.
  • The benchmark and models developed in the study should be applied to other speech processing and phoneme discovery applications, such as speech recognition and language learning systems.

Sources