Academic

A Taxonomy of Programming Languages for Code Generation

arXiv:2604.00239v1 Announce Type: new Abstract: The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware eva

N
Nishat Raihan, Christian Newman, Marcos Zampieri
· · 1 min read · 0 views

arXiv:2604.00239v1 Announce Type: new Abstract: The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.

Executive Summary

This article presents a taxonomy of programming languages for code generation, grouping 646 languages into four tiers based on their resource availability. The study reveals a significant imbalance in the distribution of tokens across languages, with only 1.9% of languages accounting for 74.6% of all tokens in seven major corpora. The findings have important implications for dataset curation and the evaluation of multilingual large language models. The taxonomy provides a principled framework for addressing these challenges, and its reproducibility ensures the reliability of the results. However, the study's generalizability to other domains and the potential for future language development to change the tier distribution are limitations that warrant further investigation.

Key Points

  • The article presents a reproducible taxonomy of programming languages for code generation.
  • The taxonomy groups 646 languages into four tiers based on their resource availability.
  • A significant imbalance is found in the distribution of tokens across languages, with only 1.9% of languages accounting for 74.6% of all tokens in seven major corpora.

Merits

Strength

The taxonomy is reproducible, ensuring the reliability of the results.

Principled Framework

The taxonomy provides a principled framework for addressing challenges in dataset curation and the evaluation of multilingual large language models.

Demerits

Generalizability

The study's generalizability to other domains is unclear.

Future Language Development

The potential for future language development to change the tier distribution is a limitation that warrants further investigation.

Expert Commentary

The article presents a significant contribution to the field of programming languages and code generation. The taxonomy provides a much-needed framework for addressing the challenges posed by the vast diversity of programming languages. However, the study's limitations, including the potential for future language development to change the tier distribution, warrant further investigation. Furthermore, the generalizability of the results to other domains is unclear, and additional research is needed to confirm the findings. Nonetheless, the taxonomy is a valuable resource for researchers and practitioners seeking to develop more effective multilingual LLMs and promote the development of programming languages and resources for underrepresented communities.

Recommendations

  • Future research should investigate the generalizability of the taxonomy to other domains and the potential for future language development to change the tier distribution.
  • The taxonomy should be applied to the development of more effective multilingual LLMs that account for the resource availability of different languages.

Sources

Original: arXiv - cs.CL