Academic

OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

arXiv:2603.14111v1 Announce Type: new Abstract: Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and

arXiv:2603.14111v1 Announce Type: new Abstract: Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at https://OasisSimpDataset.github.io/.

Executive Summary

The OasisSimp dataset fills a critical gap in sentence simplification for mid-resource and low-resource languages by providing a multilingual dataset for English, Sinhala, Tamil, Pashto, and Thai. The dataset was created by trained annotators following detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. Evaluation of eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset reveals substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset offers a valuable resource for future research in low-resource sentence simplification, revealing the limitations of current LLM-based simplification methods.

Key Points

  • The OasisSimp dataset provides a multilingual resource for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai.
  • The dataset was created by trained annotators following detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness.
  • Evaluation of eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset reveals substantial performance disparities between high-resource and low-resource languages.

Merits

Strength in Multilingual Resource

The OasisSimp dataset provides a valuable resource for multilingual sentence simplification, addressing the scarcity of high-quality data for mid-resource and low-resource languages.

Demerits

Limited Evaluation Scope

The evaluation of eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset may be limited, and further evaluations with other models or datasets are necessary to fully understand the performance disparities between high-resource and low-resource languages.

Expert Commentary

The OasisSimp dataset is a significant contribution to the field of natural language processing, particularly in the area of low-resource language processing. The dataset's creation and evaluation demonstrate the importance of multilingual datasets in understanding the performance disparities between high-resource and low-resource languages. However, the limited evaluation scope of the current study suggests that further research is necessary to fully understand the capabilities and limitations of language models in multilingual settings. The implications of the OasisSimp dataset are far-reaching, impacting both practical applications and policy decisions related to language technology development and accessibility.

Recommendations

  • Future research should focus on expanding the evaluation scope of language models on the OasisSimp dataset to include other models and datasets.
  • Policymakers should prioritize the development of language technologies for low-resource languages to address the linguistic gap in education and information accessibility.

Sources