Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon
arXiv:2604.00023v1 Announce Type: new Abstract: Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) iden
arXiv:2604.00023v1 Announce Type: new Abstract: Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.
Executive Summary
This study employs machine learning to identify non-mainstream vocabulary in Sulawesi Austronesian languages. By combining rule-based cognate subtraction with an XGBoost classifier trained on phonological features, the authors detect 438 candidate substrate forms through cognate subtraction and Proto-Austronesian cross-checking. The machine learning model distinguishes inherited from non-mainstream forms with an AUC of 0.763, revealing a phonological fingerprint. However, clustering yields no coherent word families, providing no evidence for a single pre-Austronesian language layer. The study demonstrates the potential of phonological machine learning to complement traditional comparative methods and cautions against interpreting phonological non-conformity as evidence for a shared substrate language.
Key Points
- ▸ The study combines machine learning with traditional comparative methods to detect non-mainstream vocabulary in Sulawesi Austronesian languages.
- ▸ The XGBoost classifier trained on phonological features distinguishes inherited from non-mainstream forms with a high degree of accuracy.
- ▸ The study finds no evidence for a single pre-Austronesian language layer, challenging the idea of a shared substrate language.
Merits
Methodological Innovation
The study employs a novel combination of machine learning and traditional comparative methods to detect non-mainstream vocabulary, offering a fresh perspective on the subject.
Phonological Insight
The study reveals a phonological fingerprint of non-mainstream forms, providing valuable insights into the phonological structure of Sulawesi Austronesian languages.
Demerits
Limited Sample Size
The study is limited by a relatively small sample size of 1,357 forms from six Sulawesi languages, which may not be representative of the broader Austronesian language family.
Interpretation of Results
The study's findings may be subject to interpretation, and further research is needed to fully understand the implications of the results.
Expert Commentary
This study represents a significant contribution to the field of linguistics, demonstrating the potential of machine learning to complement traditional comparative methods in detecting non-mainstream vocabulary. However, the study's findings should be interpreted with caution, and further research is needed to fully understand the implications of the results. The study's limitations, particularly the small sample size, should be addressed in future research. Nevertheless, the study's phonological analysis provides valuable insights into the structure of Sulawesi Austronesian languages, shedding light on the processes of phonological evolution and change. The study's implications for language documentation and preservation in the Sulawesi region are significant, and its findings have the potential to inform language education policy.
Recommendations
- ✓ Future research should aim to replicate the study's findings with a larger sample size to increase the generalizability of the results.
- ✓ The study's methodological innovations should be applied to other linguistic contexts to explore their broader applicability.
Sources
Original: arXiv - cs.CL