Academic

CWoMP: Morpheme Representation Learning for Interlinear Glossing

arXiv:2603.18184v1 Announce Type: new Abstract: Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable--grounded in lexicon entries--and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particu

Morris Alper, Enora Rice, Bhargav Shandilya, Alexis Palmer, Lori Levin · March 20, 2026 · 1 min read · 16 views

#cs.CL

Executive Summary

The article proposes CWoMP, a novel morpheme representation learning framework for interlinear glossing in language documentation. CWoMP treats morphemes as atomic form-meaning units with learned representations, aligning words-in-context with constituent morphemes in a shared embedding space. This approach enables interpretable predictions and allows users to expand the lexicon at inference time without retraining. Evaluations on diverse low-resource languages demonstrate CWoMP's superiority over existing methods, particularly in extremely low-resource settings. The framework's efficiency and adaptability make it a valuable tool for language documentation and linguistic research. Future applications of CWoMP may extend to other areas of natural language processing, such as machine translation and text generation.

Key Points

▸ CWoMP treats morphemes as atomic form-meaning units with learned representations
▸ Contrastively trained encoder aligns words-in-context with constituent morphemes
▸ Autoregressive decoder generates morpheme sequence by retrieving entries from a mutable lexicon
▸ Predictions are interpretable and grounded in lexicon entries
▸ Users can expand the lexicon at inference time without retraining

Merits

Strength in Low-Resource Settings

CWoMP demonstrates significant gains in extremely low-resource settings, making it a valuable tool for language documentation in underserved regions.

Efficiency and Adaptability

The framework's efficiency and adaptability allow users to expand the lexicon at inference time without retraining, making it a practical solution for language documentation and linguistic research.

Demerits

Limited Evaluation Scope

The article's evaluations are limited to a specific set of languages and tasks, and it is unclear whether CWoMP generalizes to other languages and applications.

Dependence on Pretraining

CWoMP's performance may be sensitive to the quality of pretraining data, which could limit its effectiveness in certain scenarios.

Expert Commentary

The article presents a timely and innovative contribution to the field of language documentation and NLP. CWoMP's unique approach to morpheme representation learning has the potential to significantly improve the efficiency and effectiveness of language documentation and linguistic research. However, further evaluation and testing are needed to fully understand the framework's capabilities and limitations. Additionally, the article raises important questions about the role of technology in language documentation and the potential implications for language policy and planning.

Recommendations

✓ Future research should focus on evaluating CWoMP's performance on a broader range of languages and tasks, as well as exploring its applicability to other areas of NLP.
✓ The development of CWoMP has significant implications for language policy and planning, and policymakers should consider the potential benefits and challenges of incorporating this technology into language documentation and preservation efforts.

Sources

arXiv - cs.CL

CWoMP: Morpheme Representation Learning for Interlinear Glossing

AI Commentary

Executive Summary

Key Points

Merits

Strength in Low-Resource Settings

Efficiency and Adaptability

Demerits

Limited Evaluation Scope

Dependence on Pretraining

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.