Academic

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

arXiv:2603.14053v1 Announce Type: new Abstract: Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Cu

arXiv:2603.14053v1 Announce Type: new Abstract: Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).

Executive Summary

The article introduces NepTam, a novel parallel corpus designed to bridge the resource gap in machine translation for Nepali and Tamang, two under-resourced South Asian languages. The authors present NepTam20K (20K gold standard) and NepTam80K (80K synthetic) sentence-aligned datasets, developed via data scraping, preprocessing, semantic filtering, tense/polarity balancing, expert translation, and verification. The datasets span five domains and were evaluated using pre-trained multilingual models, with fine-tuned NLLB-200 achieving the best sacreBLEU scores. This work fills a critical void in linguistic infrastructure for underrepresented languages and supports future MT research.

Key Points

  • Creation of NepTam20K and NepTam80K parallel corpora
  • Use of semantic filtering and domain-specific alignment for quality control
  • Evaluation via baseline multilingual MT models with notable sacreBLEU performance

Merits

Innovation

Development of a robust, domain-specific parallel corpus for under-resourced languages is a significant contribution to MT research and linguistic equity.

Demerits

Scalability Concern

While effective for initial experiments, the synthetic NepTam80K may introduce variability in linguistic accuracy compared to fully human-translated corpora, potentially affecting long-term MT generalization.

Expert Commentary

The NepTam project represents a commendable, methodologically rigorous advancement in low-resource MT. The authors’ balanced approach—combining automated scraping with expert-led translation and semantic filtering—addresses a persistent barrier in MT research: the absence of quality corpora for under-resourced languages. The choice of NLLB-200 for fine-tuning is particularly strategic, given its multilingual capacity and open-source accessibility, maximizing community impact. Moreover, the domain-specific segmentation (Agriculture, Health, etc.) enhances applicability across sectors, making this corpus not merely a technical artifact but a catalyst for localized innovation. While synthetic data introduces some risk of noise, the authors mitigate this by grounding the process in rigorous verification by native linguists. This work sets a new benchmark for ethical and effective low-resource MT development, and should inform future efforts in similar language families.

Recommendations

  • Extend NepTam corpus to include additional domains or dialectal variants for broader applicability
  • Publish annotated evaluation metrics and fine-tuning protocols to facilitate reproducibility and comparative research

Sources